Basic OpenMP Parallel Program Not Scaling As Expected

Question

#include <iostream>
#include <vector>
#include <stdexcept>
#include <sstream>
#include <omp.h>

std::vector<int> col_sums(const std::vector<std::vector<short>>& data) {
    unsigned int height = data.size(), width = data[0].size();
    std::vector<int> totalSums(width, 0), threadSums(width, 0);

    #pragma omp parallel firstprivate(threadSums)
    {
        #pragma omp parallel for
        for (unsigned int i = 0; i < height; i++) {
            threadSums.data()[0:width] += data[i].data()[0:width];
        }
        #pragma omp critical
        {
            totalSums.data()[0:width] += threadSums.data()[0:width];
        }
    }
    return totalSums;
}

int main(int argc, char** argv) {
    if (argc < 3) {
        std::cout << "Run program as \"executable <rows> <columns>\n";
    } else {
        std::stringstream args;
        args << argv[1] << " " << argv[2];
        int rows, columns;
        args >> rows >> columns;
        std::vector<std::vector<short>> data(rows, std::vector<short>(columns));
        std::vector<int> columnSums = col_sums(data);
    }
}

export OMP_NUM_THREADS=4
icpc -Ofast -fopenmp -g dummy.cpp -o dummy
/usr/bin/time -v ./dummy 115000 20000
CPU% = 225% (should be 380%+)

I'm fairly experienced with OpenMP and CilkPlus, but the barrier to scaling here eludes me, and this is a fairly rudimentary program. I know it has to be something obvious, but I feel like I've erased all the data hazards and control hazards. I'm totally stumped.

Why not use an omp reduction here? Not compatible with CilkPlus array notation? — Jeff Hammond, May 15 '15 at 02:11
I already tested. CilkPlus ran much faster. CilkPlus and OpenMP usually get along swimmingly, but somehow this just isn't working right. Primarily it's because the data hazards are numerous enough in this instance OpenMP steps on itself. The behavior's been as such the partial sums aren't built much, and the totalSums gets hit very often by competing threads. Maybe icpc is screwing it up, but that's what the profilers point to. — patrickjp93, May 15 '15 at 02:12
Oh...(feels like a ditz)...brb. Actually, I remember why. You get the wrong answer. omp reductions are not data-safe when you have multiple threads activating reductions on the same data set within a parallel block. I ran into this nasty mess earlier in the semester — patrickjp93, May 15 '15 at 02:22
Also, in OpenMP 4.0 you can't do a reduction on an array (what genius thought it was a good idea to get rid of that feature?). — patrickjp93, May 15 '15 at 02:38
OpenMP has never supported array reduction in C and C++ but in Fortran only, In any case, your data size of 4+ GiB greatly exceeds the CPU cache size. Have you tried to measure (e.g. using the hardware performance counters) the memory bandwidth utilisation? I guess two threads already saturate the memory channel and adding new threads won't speed up the calculation at all, therefore the poor scaling. If your system has more than one CPU **sockets**, then try to spread the threads among the sockets by setting `KMP_AFFINITY` accordingly. — Hristo Iliev, May 15 '15 at 08:49
Actually under OpenMP 3.0 array reductions were supported, though they were buggy and removed from the standard. Furthermore, I'm aware we're exceeding cache here, though frankly with SSE 4.2 I would expect L3 to be holding a good size chunk of each row as I'm traversing and the TLB fetching the next 16KB for each core pretty much without missing a beat except at the end of each row. I tried to change the final reduction to use slices of each row by using a modulus operator, but that actually lost me some performance. — patrickjp93, May 15 '15 at 12:52
It's also been suggested the vector allocation is the choke point currently weighing down the %utilization average. While I can believe this in the 2-thread case, I would still expect more than 190% utilization for 4 threads. — patrickjp93, May 15 '15 at 12:57
By using #pragma omp parallel for, you are doing all of the work in each of the outer-level-parallel threads times. So with OMP_NUM_THREADS=4 you're doing four times too much work. That may push up the amount of CPU you use, but it seems unlikely to be what you intended. Depending on whether you have also set OMP_NESTED you may be over-subscribing too. (Using instantaneous CPU as your metric doesn't tell you if the CPUs are doing anything useful...) — Jim Cownie, May 16 '15 at 15:53

Basic OpenMP Parallel Program Not Scaling As Expected

0 Answers0