3

I have a program with the general structure shown below. Basically, I have a vector of objects. Each object has member vectors, and one of those is a vector of structs that contain more vectors. By multithreading, the objects are operated on in parallel, doing computation that involves much accessing and modifying of member vector elements. One object is acessed by only one thread at a time, and is copied to that thread's stack for processing.

The problem is that the program fails to scale up to 16 cores. I suspect and am advised that the issue may be false sharing and/or cache invalidation. If this is true, it seems that the cause must be vectors allocating memory too close to each other, as it is my understanding that both problems are (in simple terms) caused by proximal memory addresses being accessed simultaneously by different processors. Does this reasoning make sense, is it likely that this could happen? If so, it seems that I can solve this problem by padding the member vectors using .reserve() to add extra capacity, leaving large spaces of empty memory between vector arrays. So, does all this make any sense? Am I totally out to lunch here?

struct str{
    vector <float> a;   vector <int> b;      vector <bool> c;  };

class objects{
    vector <str> a;     vector <int> b;      vector <float> c;  
    //more vectors, etc ...
    void DoWork();            //heavy use of vectors
};    

main(){
    vector <object> objs;
    vector <object> p_objs = &objs;

    //...make `thread_list` and `attr`
    for(int q=0; q<NUM_THREADS; q++)
        pthread_create(&thread_list[q], &attr, Consumer, p_objs );
    //...
}

void* Consumer(void* argument){
     vector <object>* p_objs = (vector <object>*) argument ;
     while(1){
         index = queued++;  //imagine queued is thread-safe global
         object obj = (*p_objs)[index]        
         obj.DoWork();
         (*p_objs)[index] = obj;
}
Matt Munson
  • 2,903
  • 5
  • 33
  • 52
  • @Alexander I will, but I'm also interested in the idea from a theoretical perspective, because I'm pretty shaky on concepts like false sharing, cache invalidation, and memory in general. In other words, I'm trying to check my conception of the situation. – Matt Munson Dec 16 '11 at 08:57
  • The code you provide is not valid C++. How exactly does it not scale up to 16 cores? Is it simply slower? If so, how much slower? Is it specifically 16 cores or does it not scale to 8, 4, or 2 cores either? – David Brown Dec 16 '11 at 09:38
  • @David The above is intended only as pseudocode. By failing to scale I mean that it does not run *near* 16X as fast with 16 cores. Scaling gets progressively worse from ~4 cores on (at least as measurable). To clarify, I'm more interested in the concept, and not in my code specifically. – Matt Munson Dec 16 '11 at 10:08
  • Are we talking about a NUMA machine here? – Sebastian Dec 16 '11 at 10:33
  • @macs yes. Specifically, two octo core Intel Xeon processors. More specifically, http://aws.amazon.com/hpc-applications/ – Matt Munson Dec 16 '11 at 10:52

1 Answers1

2

Well, the last vector copied in thread 0 is objs[0].c. The first vector copied in thread 1 is objs[1].a[0].a. So if their two blocks of allocated data happen to both occupy the same cache line (64 bytes, or whatever it actually is for that CPU), you'd have false sharing.

And of course the same is true of any two vectors involved, but just for the sake of a concrete example I have pretended that thread 0 runs first and does its allocation before thread 1 starts allocating, and that the allocator tends to make consecutive allocations adjacent.

reserve() might prevent the parts of that block that you're actually acting on, from occupying the same cache line. Another option would be per-thread memory allocation -- if those vectors' blocks are allocated from different pools then they can't possibly occupy the same line unless the pools do.

If you don't have per-thread allocators, the problem could be contention on the memory allocator, if DoWork reallocates the vectors a lot. Or it could be contention on any other shared resource used by DoWork. Basically, imagine that each thread spends 1/K of its time doing something that requires global exclusive access. Then it might appear to parallelize reasonably well up to a certain number J <= K, at which point acquiring the exclusive access significantly eats into the speed-up because cores are spending a significant proportion of time idle. Beyond K cores there's approximately no improvement at all with extra cores, because the shared resource cannot work any faster.

At the absurd end of this, imagine some work that spends 1/K of its time holding a global lock, and (K-1)/K of its time waiting on I/O. Then the problem appears to be embarrassingly parallel almost up to K threads (irrespective of the number of cores), at which point it stops dead.

So, don't focus on false sharing until you've ruled out true sharing ;-)

Steve Jessop
  • 273,490
  • 39
  • 460
  • 699
  • Ok, so the allocator *doesn't* tend to make consecutive allocations adjacent, right? So then is it possible to estimate the probability of two vectors being allocated adjacently? Also, how can I do per-thread memory allocation? – Matt Munson Dec 16 '11 at 11:15
  • Possibly it does, you can test that just by looking at the addresses `&v[0]` of two non-empty vectors created consecutively, and estimate the proportion of nearby vector blocks for your actual code the same way, just look at addresses. Configuring the memory allocator is implementation-specific. I don't know whether Amazon's environment has per-thread pools by default and if not why not and what to do about it, but typically you link against a library with a different `malloc`/`free` implementation. – Steve Jessop Dec 16 '11 at 11:23
  • would I have to modify the code, or would it just be a matter of having the library and setting the g++ command to link to it? Previously, I tried implementing hoard and tcmalloc with no luck whatsoever, are those basically what you are referring to? I'm just running Ubuntu on amazon right now, so would I be able to set it in the environment possibly? – Matt Munson Dec 16 '11 at 11:30
  • @Matt: Yes, tcmalloc is an example of what I'm talking about, although I'm not sure what it considers "large" objects and whether your vectors would therefore count as "small" enough to benefit. But I'd hope that the larger the vectors, the less activity `DoWork` does at the very ends of the blocks, and thus the less false sharing if blocks are adjacent. So hopefully the large ones don't need help. If you can't get that to link then I'm out of ideas, sorry. – Steve Jessop Dec 16 '11 at 12:01
  • Alright, cool. It seems very backwards to me that implementing per-thread heaps is not more straightforward. Presumably its not a serious technical challenge, relatively speaking. It should really be a feature of pthreads. – Matt Munson Dec 16 '11 at 12:48