Multithreading mystery in C++ with Boost

Question

static void testlock()
{
    for(int i=0;i<10000000;i++)
    {
        float f=2.0/i;
    }
}

static void TEST()
{
    cout<<"Start testing" <<endl;
    unsigned int startClock;

    for(int i=1;i<=10;i++)
    {
        startClock = clock();
        vector<boost::thread*> threads;
        for(int j=0;j<i;j++)
            threads.push_back(new boost::thread(&testlock));
        for(int j=0;j<i;j++)
        {
            threads[j]->join();
            delete threads[j];
        }
        cout << i << " threads: "<< clock()-startClock << endl;
    }
}

Output:

Start testing
1 threads: 180000
2 threads: 350000
3 threads: 540000
4 threads: 730000
5 threads: 900000
6 threads: 1080000
7 threads: 1260000
8 threads: 1510000
9 threads: 1660000
10 threads: 1810000

I'm running this code on a quad core PC (Core2Quad, 4 cores no hyperthreading) so I expected 1-4 threads to take about the same time. Instead it seems as if only one core is being used. What am I missing here?

Thanks

Update:

-I'm using Eclipse CDT under Ubuntu Linux

-I tried the same with Pthread and I get the same result

I could be wrong, but construction `threads[j]->join(); delete threads[j];` means that we have to wait until thread will be finished — Vitaly Dyatlov, Jun 19 '12 at 08:46
@VitalyDyatlov yes I start i threads at once then wait for them all to finish. The problem is that 4 threads should be executing at once but the output suggests otherwise. — Erwin J., Jun 19 '12 at 08:50
FWIW, on my 6-core Win7 the output is as expected: Start testing 1 threads: 62 2 threads: 63 3 threads: 62 4 threads: 63 5 threads: 62 6 threads: 62 7 threads: 125 8 threads: 125 9 threads: 124 10 threads: 125 — Igor R., Jun 19 '12 at 08:51
@ErwinJ problem is that `delete` wait until thread will be completed. So in a loop you wait for every single thread instead of running them simultaneously — Vitaly Dyatlov, Jun 19 '12 at 08:52
Check your process affinity: maybe it's forced to use one core? — Igor R., Jun 19 '12 at 08:53
this could be helpful: http://stackoverflow.com/questions/3344028/how-to-make-boostthread-group-execute-a-fixed-number-of-parallel-threads — Vitaly Dyatlov, Jun 19 '12 at 09:05
What operating system?? As Igor says, process affinity could be an issue. — Roddy, Jun 19 '12 at 09:06
i would like to remark that indeed this is mystery. i tried using a thread_group + join_all instead of a vector, that did not solve it. also i tried using int's instead of floats. — Willem Hengeveld, Jun 19 '12 at 09:06
@VitalyDyatlov, He starts all the threads, and then waits for all of them to finish. So they will be executing in parallel. Your linked question isn't really relevant. — Roddy, Jun 19 '12 at 09:08
@ErwinJ. In task manager, right-click on the task and select "Set Affinity..." You should see all of your cores, hopefully all selected. You aren't by any chance actually running in a single-core VM are you? I see (with 12 cores) pretty much constant times output for the ten test cases. — RobH, Jun 19 '12 at 09:16
@ErwinJ. Boost.Thread currently doesn't have such capability, so you have to use OS API. On Windows it looks like this: 1) http://msdn.microsoft.com/en-us/library/windows/desktop/ms686223(v=vs.85).aspx 2) http://msdn.microsoft.com/en-us/library/windows/desktop/ms686247(v=vs.85).aspx — Igor R., Jun 19 '12 at 09:19
@IgorR.: is this a comment because you're not sure? Otherwise, please make this an answer, so readers don't have to wade through comments to find the answer. — stefaanv, Jun 19 '12 at 09:31
@RobH I probably should have mentioned I'm using Ubuntu Linux — Erwin J., Jun 19 '12 at 09:32
@stefaanv it's just a guess - hopefully, ErwinJ will check whether it's right or wrong. — Igor R., Jun 19 '12 at 09:44
@ErwinJ. `taskset -p ` shows the processor affinity as a bit mask. — RobH, Jun 19 '12 at 10:01
@RobH Thanks. it says current affinity mask: f so that means all four processors i think. — Erwin J., Jun 19 '12 at 10:19
@ErwinJ. Yes, I think so. That's what I see in my Ubuntu VM. — RobH, Jun 19 '12 at 10:22
@ErwinJ also check that all the cores are active: grep MHz /proc/cpuinfo — Igor R., Jun 19 '12 at 10:29

score 3 · Accepted Answer · answered Jun 20 '12 at 03:36

3

A collegue of mine found the solution: clock() measures CPU cycles, so if two threads are running, it runs twice as fast. Timing with gettimeofday gave the expected result.

answered Jun 20 '12 at 03:36

Erwin J.

587
1
5
15

2

This also explains the windows discrepancies, as clock() under windows measures wall-clock seconds. – ergosys Jun 20 '12 at 03:49

Karoly Horvath · Answer 2 · 2012-06-19T16:11:42.283

1

~~First of all, with i=0 2.0/i is dividing by zero~~ (sorry, as Igor correctly noted in the comments, it's valid with floating point arithmetic, and, in this case, it results in +infinity.

Seconly, even if you fix that, your testlock function is probably going to be optimized to nothing, as the result is never used.

So at the moment you're just measuring the overhead of creating and joining threads, that's why the linear increase.

edited Jun 19 '12 at 16:11

answered Jun 19 '12 at 09:39

Karoly Horvath

94,607
11
117
176

Toy code with bugs is not a good way to measure anything and expect it to have real world applicability. – David Schwartz Jun 19 '12 at 09:56
5

You're mistaken. For IEEE floats, division of a finite nonzero float by 0 is well-defined and results in +infinity. Besides, testlock is not "optimized to nothing", at least not with MSVC10 (at any level of optimization). – Igor R. Jun 19 '12 at 09:59
Good point about the division by zero not sure why that doesn't cause problems. But the code doesn't get optimized away: I already checked that it takes 10 times as long when I increase the loop to 100000000 repetitions. – Erwin J. Jun 19 '12 at 10:00
@Igor R.: thanks (&upvoted), I didn't know that. well, in that case the msvc compiler is really really dumb... gcc definitely doesn't emit any code for that if optimisation is enabled. – Karoly Horvath Jun 19 '12 at 10:04
@Karoly Horvath just out of curiosity, what gcc version have you used with this code? – Igor R. Jun 19 '12 at 10:12
1

@Igor R.: I knew that gcc does these kind of optimizations, I wrote my answer without trying it. But now I compiled it just for you... 4.4.6-3 (that's the oldest I had here), but older versions should do the same thing. – Karoly Horvath Jun 19 '12 at 10:33

Multithreading mystery in C++ with Boost

2 Answers2