performance of openmp code and how to make it faster

Question

I'm doing the following code that construct a distance matrix between each point and all the other points that I have in the map dat[]. Although the code is working perfect, the performance of the code in terms of running time doesn't improve which means that it takes the same time if I set the number of thread = 1 or even 10 on an 8 core machine. Therefore, I'd appreciate if anyone can help me know what is wrong in my code and if anyone have any suggestion to help make the code runs faster that would be very helpful too. The following is the code:

map< int,string >::iterator datIt;
map <int, map< int, double> > dist;
int mycont=0;
datIt=dat.begin();
int size=dat.size();
omp_lock_t lock;
omp_init_lock(&lock);
#pragma omp  parallel    //construct the distance matrix
{   
    map< int,string >::iterator datItLocal=datIt;
    int lastIdx = 0;
    #pragma omp for   
    for(int i=0;i<size;i++)
    {
        std::advance(datItLocal, i - lastIdx);
        lastIdx = i;
        map< int,string >::iterator datIt2=datItLocal;
        datIt2++;
        while(datIt2!=dat.end())
        {
            double ecl=0;
            int c=count((*datItLocal).second.begin(),(*datItLocal).second.end(),delm);
            string line1=(*datItLocal).second;
            string line2=(*datIt2).second;
            for (int i=0;i<c;i++)
            {
                double num1=atof(line1.substr(0,line1.find_first_of(delm)).c_str());
                line1=line1.substr(line1.find_first_of(delm)+1).c_str();
                double num2=atof(line2.substr(0,line2.find_first_of(delm)).c_str());
                line2=line2.substr(line2.find_first_of(delm)+1).c_str();
                ecl += (num1-num2)*(num1-num2);
            }
            ecl=sqrt(ecl);
            omp_set_lock(&lock);
            dist[(*datItLocal).first][(*datIt2).first]=ecl;
            dist[(*datIt2).first][(*datItLocal).first]=ecl;
            omp_unset_lock(&lock);
            datIt2++;
        }
    }
}
omp_destroy_lock(&lock);

How have you measured the running time? Please indent the code. — devil, Jan 25 '12 at 09:56
It may seem a trivial question, but have you enabled OpenMP in compile time? If you don't enable it, all omp pragmas will be ignored. — saeedn, Jan 25 '12 at 10:04
How long did the code run? Note that there is a minimum running time you need to get any kind of improvement from openmp — Grizzly, Jan 25 '12 at 21:57

zr. · Accepted Answer · 2012-01-25T10:29:21.390

My guess is that using a single lock for protecting 'dist' serializes your program. Option 1: Consider using a fine-grained locking strategy. Typically, you benefit from this if dist.size() is much larger than the number of threads.

map <int, omp_lock_t  > locks;
...
int key1 = (*datItLocal).first;
int key2 = (*datIt2).first;
omp_set_lock(&(locks[key1]));
omp_set_lock(&(locks[key2]));
dist[(*datItLocal).first][(*datIt2).first]=ecl;
dist[(*datIt2).first][(*datItLocal).first]=ecl;
omp_unset_lock(&(locks[key2]));
omp_unset_lock(&(locks[key1]));

Option 2: Your compiler might already have this optimization mention in option 1, so you can try to drop your lock and use the built-in critical section:

   #pragma omp critical
   {
      dist[(*datItLocal).first][(*datIt2).first]=ecl;
      dist[(*datIt2).first][(*datItLocal).first]=ecl;
   }

Your code doesn't work as written, since `operator[]` is a modifying operation on a map. Therefore it only works if you lock on the whole datastructure. Besides is the compiler really allowed to do such an optimization for critical sections? I would assume it isn't. Even if I don't think it's likely that the compiler would do such a think (since it has no way of knowing how fine grained it should lock). — Grizzly, Jan 25 '12 at 22:01

score 0 · Answer 2 · answered Jan 25 '12 at 10:36

I'm a bit unsure of exactly what you're trying to do with your loops etc, which looks rather like it's going to do a quadratic nested loop over the map. Assuming that's expected though, I think the following line will perform poorly when parallelised:

std::advance(datItLocal, i - lastIdx);

If OpenMP were disabled, that's advancing by one step each time, which is fine. But with OpenMP, there are going to be multiple threads doing chunks of that loop at random. So one of them might start at i=100000, so it has to advance 100000 steps through the map to begin. That might happen quite a lot if there are lots of threads being given relatively small chunks of the loop at a time. It might even be that you end up being memory/cache constrained since you're constantly having to walk over all of this presumably large map. It seems like this might be (part of) your culprit since it may get worse as more threads are available.

Fundamentally I guess I'm a bit suspicious of trying to parallelise iteration over a sequential data structure. You may know more about which parts of it really are slow or not though if you've profiled it.

yes, I think you might be right, but in that case is there any play around? thanks — DOSMarter, Jan 25 '12 at 10:45
@marioabdelmessih: I assume that part of the code comes [from this answer to another question of yours](http://stackoverflow.com/a/8897636/201270). Note that I explicitly mentioned there that it might have to much overhead and where to look for alternative ways to iterate over a map in case that it does. — Grizzly, Jan 25 '12 at 21:53
One obvious alternative is to use a vector instead of a map, which would allow parallel iteration without all the overhead. There might be more overhead if you have to copy it in the first place though... I wonder if there is more to be gained here by avoiding a quadratic algorithm than by parallelisation, but I don't understand what it's doing well enough to say for sure that's possible or how it might be done. — Peter, Jan 26 '12 at 10:46

performance of openmp code and how to make it faster

2 Answers2