0

I have a strange behavior with an std::map (or std::set, they seem to behave the same in this scenario). It might be that I have a grave misunderstanding about how this should work. I'm using VS2010 SP1.

Take for example this function:

extern time_t g_nElapsed;
UINT Thread(LPVOID _param)
{
    UINT nRuns = (UINT)_param;

    for(UINT i=0; i<nRuns; ++i)
    {
        time_t _1 = time(NULL);
        std::set<UINT> cRandomSet;
        cRandomSet.insert(1);
        cRandomSet.insert(2);
        cRandomSet.insert(3);
        cRandomSet.insert(4);
        g_nElapsed += (time(NULL) - _1);
    }


    return 0;
}

Now, if I run 8 threads with 100,000 iterations each, g_nElapsed will be roughly 40 seconds. If I run 1 thread with 800,000 iterations, g_nElapsed will be about 5 seconds. I am under the impression that g_nElapsed should be about the same for any reasonable amount of threads. So to speak... the processor usage increases with the number of threads, even though the work stays the same. However, it seems that some sort of resource contention with the set causes the runtime to increase. But why? It's thread local...

I'm sure it's a simple misconception and a simple fix, but I am not quite sure what the problem is here.

The following code does not exhibit this behavior:

extern time_t g_nElapsed;
UINT Thread(LPVOID _param)
{
    UINT nRuns = (UINT)_param;

    for(UINT i=0; i<nRuns; ++i)
    {
        time_t _1 = time(NULL);
        UINT n[4];
    n[0] = 1;
        n[1] = 1;
        n[2] = 1;
        n[3] = 1;
        g_nElapsed += (time(NULL) - _1);
    }


    return 0;
}
namezero
  • 2,203
  • 3
  • 24
  • 37

1 Answers1

3

You are creating and destroying many containers, and each one uses operator new to allocate memory. On many systems, this requires synchronization to manage the free memory that is handed out on typical, small allocations like yours. So you are probably incurring quite a lot of inter-thread contention there.

You might try a different allocator, such as tcmalloc (http://goog-perftools.sourceforge.net/doc/tcmalloc.html). It is specifically designed to deal with this.

Another approach would be to use an object pool or other allocation strategy to avoid using the standard allocation mechanism completely. That would require some code changes, whereas using tcmalloc does not.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • 1
    An object pool implementation of an allocator would require very few code changes, beyond writing the allocator ofc. – Alex Chamberlain Feb 26 '13 at 12:58
  • 1
    Sure, but very few is a lot more than none! I'm not saying don't use object pools or whatever, they're a good idea as well. – John Zwinck Feb 26 '13 at 12:59
  • You are right. Implementing the second variant with UINT *nnn = new UINT[5] and deleting results in the same behavior as the maps and vectors. I can't believe I never considered new/delete to be the contended resource here. I will check into the tcmalloc library. – namezero Feb 26 '13 at 13:19
  • Object pools seem to be the wrong solution here because in the real program the map is created as a union of two maps (to eliminate duplicates), so with object pools there would still be plenty of contention I believe. – namezero Feb 26 '13 at 13:20
  • I'd just like to add I've settled for the scalable_allocator in Intel's TBB library. This seems to result in a _dramatic_ performance improvement! – namezero Feb 27 '13 at 18:37