Xeon Phi: Impossible to achieve perfect memory coalescing and fully utilize SMID units?

Question

I have a GPU/CUDA code that processes a cube (3D image, a spectral cube to be precise). Think of the cube as a series of images/slices, or alternatively, a bunch of spectra with different spatial locations (on a square). Each pixel of an image has different x, y values and the same z. Each pixel on a spectrum has the same x,y but varying z. The memory of the cube is aligned in a way so that two consecutive memory addresses correspond to x and x+1.

In my CUDA code I configured each CUDA thread to process a different spectrum. This way I can achieve global memory coalescing. Then I ported this code to Intel Phi (#pragma offload+OpenMP). Like in the GPU case, I have the each Phi core to process a different spectrum. As a result memory coalescing is achieved here as well. However, the performance is bad.

I assume the problem is that although I have coalescing with the global memory, the pixels across each spectrum are not on consecutive memory addresses and as a result, Phi's vectorization does not provide any performance improvement. (Remember, each core does some kind of reduction across the associated spectrum; to be more precise, it calculates the 1st, 2nd, and 3rd moments). Does this thought make sense?
If I am not mistaken in order to gain performance from SIMD your memory addresses has to be contiguous, right?
So it seems that on Xeon phi is impossible to achieve perfect memory coalescing global memory and at the same time take full advantage of the SIMD. Does this make sense or I am totally wrong?

PS: I am using Intel Xeon Phi 7120

With this memory layout you want each thread to use a full cache line of data, so that no memory bandwidth is wasted. With a line size of 64 bytes, and assuming an element size of four bytes, you would want one thread to process 16 spectra at once, assuming you have enough registers to hold all data. If you don't have enough registers, Xeon phi probably profits more from a memory layout where each spectrum is contiguous in memory, so that threads can efficiently process a single spectrum. — tera, Dec 28 '16 at 18:22
To have memory coalescing, you need a group of threads where each thread access one element of a cache line at the same moment. First Xeon phy does not have a shared cache, second Xeon cores does not run in locked-steps so the access pattern gets out of sync quickly. But Xeon Phy has 4 threads per core that runs as fine-grained multi-threading and that's the best memory coalescing you can get out of it. Each physical core should run a independent data stream. — user3528438, Dec 28 '16 at 18:23
@PaulR Oh sorry, I forgot to mention that. I am using KNC, 7120. — AstrOne, Dec 28 '16 at 23:07

Xeon Phi: Impossible to achieve perfect memory coalescing and fully utilize SMID units?

0 Answers0