OpenMP - Weird Result in Combination of parallel and SIMD namespaces

Question

I have a C++ project which uses OpenMP, and in some place in the code I have #pragma omp simd nested inside #pragma omp parallel. There was a consistent crash in the code which happened only in multi-threaded runs compiled in debug mode (and not in release). I made a short reproducible code which exemplifies the problem -

#include <iostream>
#include <atomic>
#include <omp.h>

struct A {
    int z;
};

int main() {
    size_t size = 100;
    auto A_arr = new A*[size];

#pragma omp parallel
{
#pragma omp for schedule(dynamic)
    for (size_t x = 0; x < size; ++x) {
        A_arr[x] = new A{0};
    }
}

#pragma omp parallel
{
    A** begin = A_arr;

#pragma omp simd
    for (size_t x = 0 ; x < size ; ++x) {
        A* a = *begin;
        auto z = a->z;
        begin++;
    }
}
    delete[] A_arr;
    return 0;
}

Compiling this with icpc in debug mode runs just fine. But, if I change the SIMD loop to

#pragma omp simd
    for (size_t x = 0 ; x < size ; ++x) {
        A* a = begin[x];
        auto z = a->z;
    }
}

(which should be logically equivalent) the code suddenly crashes in debug mode compilation, and works fine in release mode.

I did a lot of debugging to change isolate the problematic part in the code, and I think the example I presented needs no further context.

I also tried using gdb (in the crash it sometimes claims that a is NULL , and sometimes it points to a location in the memory which cannot be read from), and valgrind (which ran successfully).

From searching online, I understand that that the SIMD vectorization doesn't happen in -O0, but apparently the SIMD loop claims still makes the debugger to make some assumptions regarding the loop spanning, which may explain the different results in debug and release modes.=

Of course what I described here solves the problem, but I wish to understand better what happens here, and whether there's a "missing bug" which I just hid deeper.

Thanks in advance!

Please share the compiler option used (eg. `-O2` or not) and the version of the compiler. I am not able to reproduce this issue with the last code. In fact, in the last version of the code, the second loop is completely optimized out by the compiler (ICC 2021.7.1) with `-O2 -fopenmp`. By the way, `#pragma omp simd` is useless here on most architecture since it requires gather instructions that are generally not faster when even supported. The initial code had atomic instruction which cannot be SIMDified on all all architecture I am aware of (it is definitively not supported on x86-64). — Jérôme Richard, Jul 02 '23 at 14:06
Last but not least, there is a memory leak in your program: `delete[] A_arr;` do not delete the allocated values (using `A_arr[x] = new A{0};`). You need to delete them with a loop (unless you change the way they are allocated. — Jérôme Richard, Jul 02 '23 at 14:08
Thanks for the remark about the leak, I wrote the program quickly just to reproduce and missed that. Indeed there was an atomic at first, but after some more games I saw that removing it keeps the error, so I removed it. The compile options I'm using is `-O0 -g -axAVX,CORE-AVX-I,CORE-AVX2 -qopenmp -ipp -lstdc++fs`, and the compiler is `icpc (ICC) 2021.1 Beta 20201112`. I agree that the `#pragma omd simd` seems unnecessary here, I can remove it and all my problems will vanish, but I just want to figure out and understand better what happens here. Thank you very much for the comment! — Amit, Jul 02 '23 at 15:50
I would assume that it uses a gather instruction which also tries to access elements after the `A_arr`. Did you try if the error vanishes if that array size is a multiple of 8 (or 16)? And does the error appear in optimized mode, if you actually use the `z` variable (e.g, accumulate these into a `sum` variable)? — chtz, Jul 03 '23 at 10:06
Thank you for your answer! The error wasn't reproduced with `-O1`, only with `-O0`. Changing size to `320` (=16*20) reproduced the error as well. Summing `z` into `sum` variable also still kept the error. — Amit, Jul 03 '23 at 10:42
Adding a reduction to print out some results, I changed to code in https://godbolt.org/z/KxfxPrK5a. The view compares code generated for your two different simd loops. With higher optimization, the generated code becomes identical. With O0, you can see some differences. When I execute the code (on my system, dropping -ipp), I get varying results for your problematic loop, which looks like a data race. I can reproduce this issue with icpc 2021.2.0, but not with 2021.4.0 or newer. From my perspective, this looks like a compiler bug that is fixed in newer versions. — Joachim, Jul 06 '23 at 11:06
@Joachim Thanks a lot for your comment! I will try to reproduce what you said (installing newer version of `icpc`), hopefully it will fix all of my issues. — Amit, Jul 06 '23 at 15:22
@Joachim Unfortunately I could find anywhere a way to download the 2021.4.0 version of intel's oneapi toolkit. They only provide the latest version, which is not compatible with my OS. If you can assist further, it will be great! I will try manage it anyway. Thanks! — Amit, Jul 09 '23 at 13:48
For small tests you can use https://godbolt.org/ (test different compilers, different versions,...) — PierU, Jul 09 '23 at 14:30
I want to compile my entire project with icpc 2021.4, it's a big project with lots of components and files, some web application is not good enough.. — Amit, Jul 09 '23 at 14:31
On [this page](https://hackmd.io/@nadhifmr/H1HcwPeUj) they give some direct links and procedure to get the 2021.4 version. Worth giving a try. Indeed the older versions are officially available only for customers who have a commercial license. — PierU, Jul 10 '23 at 08:36
The download page offers versions for windows, macos and linux. I'm curious for which OS you could get a beta release but no current release. — Joachim, Jul 10 '23 at 19:30
I am not sure how exactly this compiler got to my OS.. I tried the link @PierU gave but unfortunately had some problems with the installation, my OS is relatively old and some things didn't work well. I tried taking a different machine and work with the latest version. and apparently my code doesn't event compile because the compiler cannot vectorize loops, so it's probably this problem :). I will try to write this in order and maybe post here a summarizing answer. Thanks a lot to everyone contributed to the discussion! — Amit, Jul 16 '23 at 05:22

OpenMP - Weird Result in Combination of parallel and SIMD namespaces

0 Answers0