2

I'm fairly new to C++, therefore please pardon if this is a stupid question, but I didn't find good example of what I'm looking for on the internet.

Basically I'm using a parallel_for cycle to find a maximum inside a 2D array (and a bunch of other operations in between). First of all I don't even know if this is the best approach, but given the length of this 2D array, I though splitting the calculations would be faster.

My code:

vector<vector<double>> InterpU(1801, vector<double>(3601, 0));
Concurrency::parallel_for(0, 1801, [&](int i) {

    long k = 0; long l = 0;
    pair<long, long> Normalized;
    double InterpPointsU[4][4];
    double jRes;
    double iRes = i * 0.1;
    double RelativeY, RelativeX;
    int p, q;

    while (iRes >= (k + 1) * DeltaTheta) k++;
    RelativeX = iRes / DeltaTheta - k;
    for (long j = 0; j < 3600; j++)
    {
        jRes = j * 0.1;
        while (jRes >= (l + 1) * DeltaPhi) l++;
        RelativeY = jRes / DeltaPhi - l;
        p = 0;
        for (long m = k - 1; m < k + 3; m++)
        {
            q = 0;
            for (long n = l - 1; n < l + 3; n++)
            {
                Normalized = Normalize(m, n, PointsTheta, PointsPhi);
                InterpPointsU[p][q] = U[Normalized.first][Normalized.second];
                q++;
            }
            p++;
        }
        InterpU[i][j] = bicubicInterpolate(InterpPointsU, RelativeX, RelativeY);
        if (InterpU[i][j] > MaxU)
        {
            SharedDataLock.lock();
            MaxU = InterpU[i][j];
            SharedDataLock.unlock();
        }
    }
    InterpU[i][3600] = InterpU[i][0];
});

You can see here that I'm using a mutex called SharedDataLock to protect multiple threads accessing the same resource. MaxU is a variable that should only containe the maximum of the InterpU vector. The code works well, but since I'm having speed performance problem, I began to look into atomic and some other stuff.

Is there any good example on how to modify a similar code to make it faster?

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
Noldor130884
  • 974
  • 2
  • 16
  • 40
  • 2
    There is no need to have shared `MaxU` variable. Each thread should find the maximum among it's vector range and then you need to find a maximum among thread results. – user7860670 Jul 18 '17 at 07:45
  • hava a look at `std::atomic`. It is another tool for synchronisation. – OutOfBound Jul 18 '17 at 07:54
  • You code is not complete, please post a full example describing your problem... Anyway, you could avoid the lock by using atomic operations. – Martin Hierholzer Jul 18 '17 at 07:55
  • 1
    Have you considered using a `combinable` for `MaxU`? i.e each thread has its own maximum value, which is later combined using `max`. – Hasturkun Jul 18 '17 at 08:15
  • @VTT Well in the end I need to have a single vector containing all the data AND a maximum to normalize it with. – Noldor130884 Jul 18 '17 at 11:44

1 Answers1

4

As mentioned by VTT, you can simply find the local maximum of each thread, and merge those afterwards With use of combinable:

Concurrency::combinable<double> CombinableMaxU;
Concurrency::parallel_for(0, 1801, [&](int i) {
    ...
        CombinableMaxU.local() = std::max(CombinableMaxU.local(), InterpU[i][j]);
}
MaxU = std::max(MaxU, CombinableMaxU.combine(std::max<double>));

Note that your current code is actually wrong (unless MaxU is atomic), you read MaxU outside of the lock, while it can be written simultaneously by other threads. Generally, you must not read a value that is being written to simultaneously unless both sides are protected by atomic semantics or locks and memory fences. A reason is that a variable access may very well consist of multiple memory accesses, depending on how the type is supported by hardware.

But in your case, you even have a classic race condition:

MaxU == 1
  Thread a                 |   Thread b
InterpU[i][j] = 3          | InterpU[i][j] = 2
if (3 > MaxU)              |  if (2 > MaxU)
SharedDataLock.lock();     | SharedDataLock.lock();
(gets the lock)            | (waiting for lock)
MaxU = 3                   | ...
SharedDataLock.unlock();   | ...
...                        | (gets the lock)
                           | MaxU = 2
                           | SharedDataLock.unlock();
MaxU == 2

Locks are hard.

You can also use an atomic and compute the maximum on that. However, I would guess1 that it still doesn't perform well inside the loop2, and outside the loop it doesn't matter whether you use atomics or locks.

1: When in doubt, don't guess - measure!

2: Just because something is atomic and supported by hardware, doesn't mean it is as efficient as accessing local data. First, atomic instructions are often much more costly than their non-atomic counterparts, second you have to deal with very bad cache effects, because cores/caches will fight for the ownership of the data. While atomics may be more elegant in many cases (not this one IMHO), reduction is faster most of the time.

Zulan
  • 21,896
  • 6
  • 49
  • 109
  • 1
    If I'm not mistaken, you should be accessing and assigning `CombinableMaxU.local()` in the loop. Also, you can probably finish with just `MaxU = CombinableMaxU.combine(std::max);`, assuming its prior value doesn't matter. – Hasturkun Jul 18 '17 at 08:46
  • @Hasturkun forgot about the `.local()`, thanks. TBH I am not very familiar with PPL, though it seems quite intuitive. I wanted my answer to be idiomatic, rather than the generic manual solution I used before the edit. From the question I don't know whether the prior value of `MaxU` matters, so I just use a semantically equivalent code. – Zulan Jul 18 '17 at 09:01
  • @Zulan, I don't understand how the current code is wrong, since no matter what MaxU contains the absolute maximum... I actually wouldn't care the order in which the maximum is compared with its previous value... Anyway, your answer is very much appreciated and looks exactly what I was searching for. Thanks for having taught me something new! – Noldor130884 Jul 18 '17 at 11:47
  • @Noldor130884 I added a more detailed explanation about the various reasons why your current code is wrong in my answer. – Zulan Jul 18 '17 at 11:59
  • @Zulan sorry to bother again, but neither `MaxU = CombinableMaxU.combine(std::max);` or `MaxU = std::max(MaxU, CombinableMaxU.combine(std::max));` are working. Am I missing something? How do I have to declare MaxU? – Noldor130884 Jul 18 '17 at 12:33
  • In what way do they not work? You haven't shown how `MaxU` is declared, but I assumed `double`. – Zulan Jul 18 '17 at 12:45
  • Yes it is: funny thing though, if I write "plus" instead of "max" it does work. It says that the arguments of the function are not those expected – Noldor130884 Jul 18 '17 at 12:47
  • Any chance you have an excess pair of `()` there? It's `combine(std::max)`, but `combine(std::min())`. While I can't test myself, this is [reportedly working](https://stackoverflow.com/a/10365865/620382), although [sometimes](https://stackoverflow.com/a/22943575/620382) `combine([](double left, double right){ return std::maxleft, right);})` is suggested. – Zulan Jul 18 '17 at 12:59
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/149486/discussion-between-noldor130884-and-zulan). – Noldor130884 Jul 18 '17 at 13:01
  • Solved it by using a template max function created by myself: – Noldor130884 Jul 18 '17 at 13:58