1

I'm just starting to use OpenMP and am writing a function which divides an array into numBlocks blocks and computes a histogram on every block (i.e., one histogram per block) by inspecting the blockSize elements of each block (in the code I'm providing, the histogram is recording the divisibility of the elements in the block by the integers 1 to numBuckets).

In the first omp for loop I create a thread for each block using:

#pragma omp for schedule(static)
for(uint blockNum = 0; blockNum < numBlocks; blockNum++){
    for(uint blockSubIdx = 0; blockSubIdx < blockSize; blockSubIdx++){
        uint idx = blockNum * blockSize + blockSubIdx;
        // Compute histogram here by examining array[idx]
    }
}

Implementing it another way, I request threads to each operate on blockSize elements, where I have previously asserted that array.size() == numBlocks * blockSize:

#pragma omp for schedule(static, blockSize)
for(uint idx = 0; idx < array.size(); idx++){
    uint blockNum = idx / blockSize;
    // Compute histogram here by examining array[idx]
}

This second method does not work correctly if I increase the number of threads (using export OMP_NUM_THREADS) above some threshold (78 on my compute node)--the resulting histogram values do not match those from a serial computation.

According to https://computing.llnl.gov/tutorials/openMP/, it looks like the blocks of size blockSize in the second method are contiguous, so I'm not clear as to why it's failing. It seems like multiple threads are writing to the same index in the histogram, though. Is there a subtlety that I'm missing?

Here is a gist of the whole source code: https://gist.github.com/anonymous/5391777.

UPDATE: The thread threshold does show some dependence on array.size(). It's not the particular value of the threshold that I'm trying to understand, but rather the existence of the threshold.

DMack
  • 1,117
  • 1
  • 10
  • 14

1 Answers1

2

It looks to me like you are trying to fuse the loop. You could try this. I don't know if that answers your question.

#pragma omp parallel for schedule(static)
for(uint n = 0; n < (numBlocks*blockSize); n++){
    uint blockNum = n/blockSize;
    uint blockSubIdx = n%blockSize;
    uint idx = blockNum * blockSize + blockSubIdx;
    // Compute histogram here by examining array[idx]
}

Also, why are you trying to increase the number of threads? Your CPU can only run a certain number of hardware threads at once. Increasing the number of threads for OpenMP I don't think is going to help performance. Why not just leave them at the default?

Lastly if computing the histogram takes different time depending on idx you might want to try schedule(dynamic) instead.

  • This code snippet exhibits the same bug as my second code snippet (the computed histogram is corrupted). In fact, since `schedule` isn't passed `blockSize` for its second argument, I would expect the computed histogram to be corrupted (multiple threads would be updating histogram counts for the same block). I've tried increasing the number of threads because I want to see the performance degrade when there's more than 1 thread per CPU, just to show that it actually happens. – DMack Apr 16 '13 at 16:56
  • I just ran your default code. It works fine for me. I don't know what you mean by high thread count. Are you changing omp_set_num_thread(n)? I don't see that anywhere in your code. I don't see a bug. The code I listed above should work fine. It's a standard way to fuse a loop as long as you use idx as defined. –  Apr 16 '13 at 17:23
  • Prior to running the code, I set the OMP_NUM_THREADS environment variable (as mentioned in my post). I guess it would have been more clear if I used omp_set_num_threads() in my code instead. When I set that too high (above 78 on my compute node), the code doesn't work (the computed histogram is corrupted). Essentially, my question is why the fused loop exhibits this thread count dependent behavior. – DMack Apr 16 '13 at 17:45
  • First, why are you changing the number of threads? My CPU can run at most eight hardware threads at once. Changing the number of threads beyond eight is not going to make my code any faster. I think you should use the default # of threads. As to why it gets corrupt when you set it to 79 I think it could be related to the fact that you're running over 8 blocks and 10 buckets so 80 iterations and then you set 79 threads. That seems suspicious to me. Maybe somebody else can explain why that matters that but if I were you I would just use the default threads (which work) and move on. –  Apr 16 '13 at 18:00
  • I increased the number of threads purely as an academic exercise (I was just curious...I don't plan on doing it regularly). Given that `blockSize` is `array.size() / 8`, I would have expected that only 8 threads would do actual work (assuming I use the `#pragma omp for schedule(static, blockSize)` directive) , regardless of what I set OMP_NUM_THREADS to. But based on the corruption, it would appear that something fishy is going on. Also, I should have mentioned earlier that the "magic" number of 78 is dependent on `array.size()`--I'll update the post to make this clear. – DMack Apr 16 '13 at 18:31
  • 1
    So I ran your code again in Visual Studio. I can't reproduce your problem. I set the number of threads to 79, 80, 1000,... and it always works fine for me. However, I did have to change your code for it to compile. Visual Studio says "error C3106: 'idx' index variables in OpenMP 'for' statement must have signed integral type. You're using unsigned int. I changed your definition of uint to be signed int and it compiled and I never get a problem. –  Apr 17 '13 at 07:36
  • Very interesting. I tried changing to signed int and got a segfault (strange since `SIZE_TEST_VECTOR` is less than `INT_MAX`), but if I change to signed or unsigned long it works fine. According to http://stackoverflow.com/a/2822064/1647819 (and the OpenMP 3.0 spec), unsigned ints should be fine (I'm using OpenMP 3.0, libgomp). So I'll chalk this up to an implementation bug somewhere. Thanks @raxman for the sleuthing! – DMack Apr 18 '13 at 05:42
  • Yeah, Visual Studio only Supports OpenMP 2.0 which probably does not support unsigned int in for loops. It also does not have the collapse clause which is exactly why I fuse (collapse) the loops myself. –  Apr 18 '13 at 08:11