Best arrays questions in September 2010

std::vector is so much slower than plain arrays?

41 votes

I've always thought it's the general wisdom that std::vector is "implemented as an array," blah blah blah. Today I went down and tested it, seems to be not so:

Here's some test results:

UseArray completed in 2.619 seconds
UseVector completed in 9.284 seconds
UseVectorPushBack completed in 14.669 seconds
The whole thing completed in 26.591 seconds

That's about 3 - 4 times slower! Doesn't really justify for the "vector may be slower for a few nanosecs" comments.

And the code I used:

#include <cstdlib>
#include <vector>

#include <iostream>
#include <string>

#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>

class TestTimer
{
public:
    TestTimer(const std::string & name) : name(name),
        start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
    {
    }

    ~TestTimer()
    {
        using namespace std;
        using namespace boost;

        posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
        posix_time::time_duration d = now - start;

        cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
            " seconds" << endl;
    }

private:
    std::string name;
    boost::posix_time::ptime start;
};

struct Pixel
{
    Pixel()
    {
    }

    Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
    {
    }

    unsigned char r, g, b;
};

void UseVector()
{
    TestTimer t("UseVector");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
        pixels.resize(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}

void UseVectorPushBack()
{
    TestTimer t("UseVectorPushBack");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
            pixels.reserve(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
            pixels.push_back(Pixel(255, 0, 0));
    }
}

void UseArray()
{
    TestTimer t("UseArray");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);

        for(int i = 0 ; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }

        free(pixels);
    }
}

int main()
{
    TestTimer t1("The whole thing");

    UseArray();
    UseVector();
    UseVectorPushBack();

    return 0;
}

Am I doing it wrong or something? Or have I just busted this performance myth?

Edit: I'm using Release mode in MSVS2005.


In MSVC, #define _SECURE_SCL 0 reduces UseVector by half (bringing it down to 4 seconds). This is really huge IMO. Someone please upvote me a couple more times so this gets known by more folks :p

Using the following:

g++ -O3 Time.cpp -I <MyBoost>
./a.out
UseArray completed in 2.196 seconds
UseVector completed in 4.412 seconds
UseVectorPushBack completed in 8.017 seconds
The whole thing completed in 14.626 seconds

So array is twice as quick as vector.

But after looking at the code in more detail this is expected; as you run across the vector twice and the array only once. Note: when you resize() the vector you are not only allocating the memory but also running through the vector and calling the constructor on each member.

Re-Arranging the code slightly so that the vector only initializes each object once:

 std::vector<Pixel>  pixels(dimensions * dimensions, Pixel(255,0,0));

Now doing the same timing again:

g++ -O3 Time.cpp -I <MyBoost>
./a.out
UseVector completed in 2.216 seconds

The vector now performance only slightly worse than the array. IMO this difference is insignificant and could be caused by a whole bunch of things not associated with the test.

I would also take into account that you are not correctly initializing/Destroying the Pixel object in the UseArrray() method as neither constructor/destructor is not called (this may not be an issue for this simple class but anything slightly more complex (ie with pointers or members with pointers) will cause problems.

Have Arrays in .NET lost their significance?

27 votes

For every situation that warrants the use of an array ... there is an awesome collection with benefits. Is there any specific use case for Arrays any more in .NET?

Sending/Receiving data with a specific length comes to mind, ie. Serial Port, Web Request, FTP Request. Basically stuff that works on a lower level in the system. Also, most Collections are using an array for storage (Noteable exception: LinkedList<T>). Collections are just another abstraction layer.

What does sorting mean in double-byte languages?

21 votes

I have some code that sorts table columns by object properties. It occurred to me that in Japanese or Chinese (non-alphabetical languages), the strings that are sent to the sort function would be compared the way an alphabetical language would.

Take for example a list of Japanese surnames:

寿拘
松坂
松井
山田
藤本

In English, these would be Suzuki, Matsuzaka, Matsui, Yamada, Fujimoto.

When I sort the above list via Javascript, the result is:

寿拘
山田
松井
松坂
藤本

(Suzuki, Yamada, Matsui, Matsuzaka, Fujimoto) This is different from the ordering of the Japanese syllabary, which would order the list (phonetically) as Suzuki, Fujimoto, Matsui, Matsuzaka, Yamada.

What I want to know is:

  1. Does one double-byte character really get compared against the other in a sort function?
  2. What really goes on in such a sort?
  3. (Extra credit) Does the result of such a sort mean anything at all? Does the concept of sorting really work in Asian (and other) languages? If so, what does it mean and what should one strive for in creating a compare function for those languages?

ADDENDUM TO SUMMARIZE ANSWERS AND DRAW CONCLUSIONS:

First, thanks to all who contributed to the discussion. This has been very informative and helpful. Special shout-outs to bobince, Lie Ryan, Gumbo, Jeffrey Zheng, and Larry K, for their in-depth and thoughtful analyses. I awarded the check mark to Larry K for pointing me toward a solution my question failed to foresee, but I up-ticked every answer I found useful.

The consensus appears to be that:

  1. Chinese and Japanese character strings are sorted by Unicode code points, and their ordering may be predicated on a rationale that may be in some way intelligible to knowledgeable readers but is not likely to be of much practical value in helping users to find the information they're seeking.

  2. The kind of compare function that would be required to make a sort semantically or phonetically useful is far too cumbersome to consider pursuing, especially since the results would probably be less than satisfactory, and in any case the comparison algorithms would have to be changed for each language. Best just to allow the sort to proceed without even attempting a compare function.

  3. I was probably asking the wrong question here. That is, I was thinking too much "inside the box" without considering that the real question is not how do I make sorting useful in these languages, but how do I provide the user with a useful way of finding items in a list. Westerners automatically think of sorting for this purpose, and I was guilty of that. Larry K pointed me to a Wikipedia article that suggests a filtering function might be more useful for Asian readers. This is what I plan to pursue, as it's at least as fast as sorting, client-side. I will keep the column sorting because it's well understood in Western languages, and because speakers of any language would find the sorting of dates and other numerical-based data types useful. But I will also add that filtering mechanism, which would be useful in long lists for any language.

You could implement the Unicode Collation Algorithm in Javascript if you want something better than the default JS sort for strings. Might improve some things. Though as the Unicode doc states:

Collation is not uniform; it varies according to language and culture: Germans, French and Swedes sort the same characters differently. It may also vary by specific application: even within the same language, dictionaries may sort differently than phonebooks or book indices. For non-alphabetic scripts such as East Asian ideographs, collation can be either phonetic or based on the appearance of the character.

The Wikipedia article points out that since collation is so tough in non-alphabetic scripts, now a days the answer is to make it very easy to look up information by entering characters, rather than by looking through a list.

I suggest that you talk to truly knowledgeable end users of your application to see how they would best like it to behave. The problem of ordering Chinese characters is not unique to your application.

Also, if you don't want to implement the collation in your system, another solution would for you to create a Ajax service that stores the names in a MySql or other database, then looks up the data with an order statement.

How to find repeating sequence of characters in given array

13 votes

my problem is to find the repeating sequence of characters in the given array. simply, to identify the pattern in which the characters are appearing.

example: for the examples in the above image the output for the

First array should be "JAMESON"

Second array should be "RON"

Third array should be "SHAMIL"

Fourth array should be "CARPENTER"

how to deal with this problem efficiently?

For your examples, my first approach would be to

  1. get the first character of the array (for your last example, that would be C)
  2. get the index of the next appearance of that character in the array (e.g. 9)
  3. if it is found, search for the next appearance of the substring between the two appearances of the character (in this case CARPENTER)
  4. if it is found, you're done (and the result is this substring).

Of course, this works only for a very limited subset of possible arrays, where the same word is repeated over and over again, starting from the beginning, without stray characters in between, and its first character is not repeated within the word. But all your examples fall into this category - and I prefer the simplest solution which could possibly work :-)

If the repeated word contains the first character multiple times (e.g. CACTUS), the algorithm can be extended to look for subsequent occurrences of that character too, not only the first one (so that it finds the whole repeated word, not only a substring of it).

Note that this extended algorithm would give a different result for your second example, namely RONRON instead of RON.

How to Use String.Filter with Multi-Value Parameters

5 votes

In the Expression builder window in SQL Server Reporting Services 2008 R2 under Common Functions -> Text -> Item, there is an expression called Filter. This appears to correspond with the Strings.Filter method in the .NET framework. The description of Filter is as follows:

Returns a zero-based array containing a subset of a String array based on specified filter criteria.

The Example is as follows:

=Filter(Parameters!MultivalueParameter.Value, "3", True, CompareMethod.Binary)

The example and description imply that you can inspect a multi-value parameter to see if at least one of the selected values is equal to the Match parameter. I haven't been able to get this to return anything other than #Error which implies the multi-value parameter is not a one-dimensional array. Parameters!MultivalueParameter.Value.GetType().ToString() returns System.Object[].

Does anyone know how to get this to work? I'm using the following work around to check if values were selected in the multi-value parameter:

=IIF(InStr(" " + JOIN(Parameters!MultivalueParameter.Value, " ") + " ", " 3 ", CompareMethod.Text), false, true)

The above code works, but it is pretty ugly. I would prefer to use the Filter function if it supports this kind of check. Can anyone give an example of code that works?

The problem is occurring because the example from MSDN is somewhat lacking for this discussion. It is true that =Filter(Parameters!MultivalueParameter.Value, "3", True, CompareMethod.Binary) returns an array but, in terms of SSRS, you can't simply output an array to a report. That is part of the reason why you're seeing the error.

Also, SSRS seems to have problems handling the two optional parameters of the Filter function. Leave those out and you'll be good to go.

You can immediately test this out by outputting the length of the array to a textboc.

=Filter(Parameters!MultivalueParameter.Value, "3").Length

The above should result in a textbox with the integer length of the filtered parameter.

So, pullling this all together, you can achieve your desired result with the following:

=IIF(Filter(Parameters!MultivalueParameter.Value, " 3 ").Length > 0, "false", "true")