Last threads execute slower than first threads while writing to an array

Question

I'm trying to optimize a Mandelbrot set generator, the problem is that i am trying to make it multi threaded by using the _beginthread() function. The computing problem I'm solving is running a function on a 2D plane, I am trying to run about 8 threads at the same time, each of them computing a portion (row) of the 2D array, but I notice that the first threads that finish, finish a lot faster that the last threads that finish. this is the output:

Starting thread 0
Starting thread 1
Starting thread 2
Starting thread 3
Starting thread 4
Starting thread 5
Starting thread 6
Starting thread 7
Ending thread   0 - Time taken: 1062ms
Ending thread   7 - Time taken: 1031ms
Ending thread   1 - Time taken: 1610ms
Ending thread   6 - Time taken: 1563ms
Ending thread   2 - Time taken: 10265ms
Ending thread   5 - Time taken: 10219ms
Ending thread   4 - Time taken: 31609ms
Ending thread   3 - Time taken: 31641ms

Every thread has the same thing to do, but with different numbers, I don't understand why I get those times This is how I multithreaded this:

#define HEIGHT 4000
#define WIDTH 4000
#define MAX_THREADS 8
int const maxIterations = 150;

int bitmap[HEIGHT][WIDTH];
bool finishedThreads[MAX_THREADS];

void renderRow(void * arg) {
    int startTime = GetTickCount();
    int * threadNumPinter = (int*)arg;
    int threadNum = *threadNumPinter;
    int startRow = threadNum * (HEIGHT / MAX_THREADS);
    for (int y = startRow; y <= startRow+(HEIGHT / MAX_THREADS); y++) {
        for (int x = 0; x <= WIDTH; x++) {
            double xx = (((double)x / (double)WIDTH) * 4.0) - 2.0;
            double yy = (((double)y / (double)HEIGHT) * 4.0) - 2.0;
            bitmap[x][y] = isPartOfSet(xx, yy) * 10;
        }
    }
    threadNum = startRow / (HEIGHT / MAX_THREADS);
    finishedThreads[threadNum] = true;
    cout << "Ending thread " << threadNum << " - Time: " << GetTickCount() - startTime << "ms" << endl;
    _endthread();
}


int main() {
    int startTime = GetTickCount();
    HANDLE hThread;
    HANDLE ghEvents[2];
    DWORD dwThreadID;
    int rowsPerThread = HEIGHT / MAX_THREADS;
    int arg;
    int threadIds[MAX_THREADS];
    for (int i = 0; i < MAX_THREADS; i ++) {
        threadIds[i] = i;
        cout << "Starting thread " << i << endl;
        arg = i;
        _beginthread(renderRow, 0, &threadIds[i]);
        Sleep(10);
    }
    bool done = true;//Wait for all threads to finish
    while (1) {
        for (int i = 0; i < MAX_THREADS; i++){
            if (finishedThreads[i] == false)done = false;
        }
        if (done == true) break;
        else done = true;
        Sleep(20);
    }
    saveBitmap(WIDTH, HEIGHT);
    cout << endl << "Rendered in " << double(GetTickCount() - startTime) / 1000.0 << " seconds" << endl;
    cin.get();
    main();
}

There is obviously more code than that, but I don't think it has any effect on the issue. What am I doing wrong here? I had the same issue on CUDA, so I believe it's how I'm implementing mutithreading. Thanks.

You are accessing the 2D array in a cache-unfriendly manner. (It's also confusing width and height.) — Raymond Chen, Jan 06 '16 at 21:52
Christophe: It's an i7 920, and the hardware_concurrency() function returns 8. deviantfan: I tried to synchronize the threads, but I lack experience with multithreading and it went wrong Nick: Didn't notice that :P — Bruno, Jan 06 '16 at 21:56
As Raymond stated, you're indexing may be the culprit, try swapping x & y in your inner loop so see if the timing changes. — Nick, Jan 06 '16 at 22:02
I swapped the x and y for loops, I still have the same times and a similar order of exiting threads (0,7,1,6,2,5,3,4) — Bruno, Jan 06 '16 at 22:06
Can you fix you undefined behavior (`x <= WIDTH` means writing outside of array bounds) and see if it changes the result? — The Dark, Jan 06 '16 at 23:13
The i7-920 has only 4 physical cores. (It has 8 virtual cores.) — Raymond Chen, Jan 06 '16 at 23:52

ThorngardSO · Accepted Answer · 2016-01-07T00:27:18.620

In my answer, I'll neither address threading/synchronizing issues or thoughts on caching -- see the other answers/comments for that.

My point is a different one: you write that "Every thread has the same thing to do, but with different numbers". If my memory on the mandelbrot set serves me right, determining whether a point is a member of the set (IOW the implementation of your isPartOfSet function, which you did not provide) is an iterative process. Some points "bail out" quickly, some points dont and you have to keep on iterating until your predefined maximum-numer-of-iters.

So what I'm saying is: with your "one-large-block-per-thread" parallelization, it's probably natural that your threads take significantly different amounts of time.

The solution to this kind of issue is to split the problem (i.e. the image) into smaller portions, the size of which is not dependent on the number of threads, but should be chosen empirically to be a) not too large to prevent unequal work distribution (as in your example with the huge blocks) and b) not so small as to cause excessive organizational overhead.

So now, you have M threads and N chunks of work (with N>>M), and you need an implementation that lets each thread work in a loop like

while (worktodo) fetch_a_chunk_of_work_and_do_it ()

How this kind of producer/consumer pattern is implemented -- I'll leave that for others to describe (or for you to google :-))

THANK YOU! I cant believe I missed that! I noticed when I made a "row counter" that it started off very quickly and it got slower and slower (at mid point) and got faster and faster, and I Immediately thought, "of course, it's because of all the iterations it has to go through while it's in the middle!" But I didn't think of this when I multi threaded it! Thank you very much! :D and that's because why 4 and 3 take the most amount of time and 0 and 7 the least amount, in fact you can see how it's distributed in pairs, because it's symmetric... — Bruno, Jan 07 '16 at 01:48

score 1 · Answer 2 · answered Jan 06 '16 at 22:42

1

Classic example of incorret concurrent usage of global variable.

bool finishedThreads[MAX_THREADS];

is global, and is accessed from multiples threads (written/read). You can not expect this to work. For your case, you shouldn't even use this variable. Instead, you should be waiting on events of thread completion.

answered Jan 06 '16 at 22:42

SergeyA

61,605
5
78
137

I just wrote that to know when threads have finished and it works, it's just a 8 bool array, If writing to global from multiple threads is a problem, then the issue can be writing to bitmap[x][y], every thread is writing 4,000,000 bytes for with and height = 4000 and bitmap[x][y] is global... But I still don't think that is the issue, because if I comment out `bitmap[x][y] = isPartOfSet(xx, yy) * 10;` and make that line `isPartOfSet(xx, yy) * 10;` the exact same thing happens – Bruno Jan 06 '16 at 22:56
@Bruno `it works` No it does not, and if it works in *your case*, you're just lucky. – deviantfan Jan 07 '16 at 08:01

score 0 · Answer 3 · edited Jun 20 '20 at 09:12

Hard coding to 8 threads is terrible, what about some user's dual core laptop? std::thread::hardware_concurrency.

Sleep is awful. Your spin-loop is absolutely NOT the right way to do it. Sorry, just being honest.

Use std::thread and use join to wait for them to complete. Even better: do all but one work item on other threads, do one on the main thread, then join the other ones. If there are N CPUs then you should create N-1 threads and do one item on the main thread.

Why use Windows-only APIs, when there are drastically better standard C++ library classes?

Suggested way to avoid `Sleep`

If simply waiting for the thread to exit is not sufficient (using join mentioned above), in a more complex scenario, then you should use std::mutex, std::unique_lock, and std::condition_variable.

You should have a variable that is set to true when the notification happens. In the code that waits, you acquire the mutex, check that flag, and if it isn't set, you call wait on the condition variable.

In the thread that notifies the other thread, you acquire the mutex, set the flag variable I mentioned, use the notify_one or notify_all method on the condition variable.

Have a look over this reference on cppreference. The main ones you use are the ones I already mentioned, though.

Ok, i will try to use the std::thread library, and the hard coded 8 thread was just a temporal thing to try my code, I was going to make it dynamic. Also, what do you recommend instead of Sleep()? Yhea, i'm very new to mulithreading :/ I should try to learn more about a topic before asking a question about that on stackoverflow. — Bruno, Jan 06 '16 at 23:05
@Bruno I expanded my answer to be more specific on ways to avoid `Sleep` and spin loops. I hope it is more helpful. — doug65536, Jan 10 '16 at 06:13

Last threads execute slower than first threads while writing to an array

3 Answers3

Suggested way to avoid Sleep

Suggested way to avoid `Sleep`