What can prevent multiprocessing from improving speed - OpenMP?

Question

I am scanning through every permutation of vectors and I would like to multithread this process (each thread would scan all the permutation of some vectors). I manage to extract the code that would not speed up (I know it does not do anything useful but it reproduces my problem).

int main(int argc, char *argv[]){

    std::vector<std::string *> myVector;
    for(int i = 0 ; i < 8 ; ++i){
        myVector.push_back(new std::string("myString" + std::to_string(i)));
    }
    std::sort(myVector.begin(), myVector.end());

    omp_set_dynamic(0);
    omp_set_num_threads(8);
#pragma omp parallel for shared(myVector)
    for(int i = 0 ; i < 100 ; ++i){

        std::vector<std::string*> test(myVector);
        do{ //here is a permutation
        } while(std::next_permutation(test.begin(), test.end())); // tests all the permutations of this combination

    }
    return 0;
}

The result is :

1 thread : 15 seconds
2 threads : 8 seconds
4 threads : 15 seconds
8 threads : 18 seconds
16 threads : 20 seconds

I am working with an i7 processor with 8 cores. I can't understand how it could be slower with 8 threads than with 1... I don't think the cost of creating new threads is higher than the one to go through 40320 permutations.. so what is happening?

Post real code, or profile it. You do nothing in a loop, setting up thread team and running loop in parallel has it's overhead. Also, do you compile with optimizations on? You also do a very very weird thing. Why to do you make copies of pointers in each loop?! — luk32, Jun 03 '15 at 09:22
I know there is nothing in the loop but the time measurements that I provide have been calculated with this code below. The fact that there is 40320 permutations to go through for each iteration is enough to see differences between the number of threads. (And yes, optimization is on, otherwise it would not go faster with two threads) — Arcyno, Jun 03 '15 at 09:25
"And yes, optimization is on, otherwise it would not go faster with two threads" - that is a false presumption, there is no logical link there. " The fact that there is 40320 permutations to go through for each iteration is enough to see differences between the number of threads." - prove it. I say 40320 times nothing to do takes 0 time. — luk32, Jun 03 '15 at 09:27
@luk32 While I agree with your reasoning, if we look at the timings, we can see that there is a fair amount of work being done. Each iteration should take ~150 ms. Be that as it may, Arcyno, it would help others that are trying to reproduce the exact problem if you would post the code that does the work. — Avi Ginsburg, Jun 03 '15 at 09:30
@lok32 You might be right about that... Maybe the amount of work in the loop is not enough to compensate the creation of threads.. I will try with some real work inside the `do` — Arcyno, Jun 03 '15 at 09:32
You could check if timings change when you change the size of vector, if they do, then obviously amount of work inside the loop changes. @Avi is also right. That is why I asked about optimizations. I think compiler should be able to deduce that the loop is safe to be removed, but I don't think that copying 8 pointers should take 150ms. I also thought that you might have removed some code, that could potentially influence the result, like some memory accessing, but you say you didn't. — luk32, Jun 03 '15 at 09:34
I think you might want to check your optimization settings, since with optimizations off, I get times comparable to yours. With them on, I get times close to zero (I think my optimizer is eliminating a lot of that symmetrical logic that isn't causing side effects and isn't really doing anything except permuting temp data that is only going to be discarded right away). — , Jun 03 '15 at 09:39
You are right. optimization is off. I though he meant `\openMp`... Sorry for the confusion — Arcyno, Jun 03 '15 at 09:40
With compiler/linker optimization settings on, perhaps you'll see OMP making a beneficial difference with more threads if you have a test case that the optimizer doesn't obliterate to shreds. — , Jun 03 '15 at 09:41
You should make loop do some work, and profile with optimizations on. c++ standard library is quite slow with out it, there is no real point in analysing performance with out optimizations. There is a good chance you will get expected speed ups. — luk32, Jun 03 '15 at 09:42
I ran this with a profiler and it appears that most of the time is spent in `std::Lockit` any idea what this is ? — Arcyno, Jun 03 '15 at 09:45
@Arcyno Look [here](http://stackoverflow.com/questions/16770179/what-std-lockit-does). Note Kerrek SB's comment. Don't waste your time profiling debug/non-optimized code. — Avi Ginsburg, Jun 03 '15 at 09:47
Also when profiling superficial code, you have to be careful to make sure the optimizer isn't actually optimizing away the logic. I've gotten bitten by this a couple of times where I was like, "OMG, I sped it up 10,000x!" --"Doh, no wait, the disassembly shows that the optimizer just skipped the work outright since it noticed it wasn't actually doing anything" (not causing any side effects). — , Jun 03 '15 at 09:58
Thanks all for your help it seems the real problem was the `std::lockit` because now the speed op process works. And with optimization on, it's always better ! ;) — Arcyno, Jun 03 '15 at 10:01
I should write an answer for this post if it could help someone.. — Arcyno, Jun 03 '15 at 10:02

score 0 · Accepted Answer · answered Jun 03 '15 at 10:05

Thanks to the help of everyone, I finally manage to find the answer :

There were two problems :

A quick performance profiling showed that most of the time was spent in std::lockit which is something used for debug on visual studio.. to prevent that just add this command line /D "_HAS_ITERATOR_DEBUGGING=0" /D "_SECURE_SCL=0". That was why adding more threads resulted in loss of time
Switching optimization on helped improve the performance

What can prevent multiprocessing from improving speed - OpenMP?

1 Answers1