Coprocessor accelerators compared to GPUs

Question

Are coprocessors like Intel Xeon-Phi supposed to be utilized much like the GPUs, so that one should offload a large amount of blocks executing a single kernel, so that only the overall throughput the coprocessor handles results in a speed up, OR offloading independent threads (tasks) will increase the efficiency as well?

Generally, both require a substantial degree of parallelism. For anything more meaningful your question is extremely broad. — void_ptr, Feb 04 '15 at 22:49
It is incorrect to suggest that these usage models are mutually exclusive. Intel Xeon Phi supports them both. For a more detailed answer, ask a more precise question. — Jeff Hammond, May 14 '15 at 16:21

score 2 · Accepted Answer · answered Feb 05 '15 at 16:04

2

The Xeon Phi requires a large degree of both functional parallelism (different threads) and vector parallelism (SIMD). Since the cores are essentially enhanced Pentium processors, serial code runs slowly. This will change somewhat with the next generation as it'll use faster and more modern cores. The current Xeon Phi also suffers from the I/O bottleneck as does any coprocessor, having to communicate over a PCIe bus.

So though you could offload a kernel to every processor and exploit the 512-bit vectorization (similar to a GPGPU), you can also separate your code into many different functional blocks (i.e. different codes/kernels) and run them on different sets of Intel Xeon Phi cores. Again, the different blocks of code must also exploit the 512-bit SIMD vectors.

The Xeon Phi also operates as a native processor, so you can access other resources by mounting NFS directory trees, communication between cards and other processors in the cluster using TCP/IP, using MPI, etc. Note that this is not 'offload' but native execution. But the PCIe bus is still a significant bottle neck limiting I/O.

To summarize,

You can us an offload model similar to that used by GPGPUs,
The Xeon Phi itself also can support functional parallelism (more than one kernel) but each kernel must also exploit the 512-bit SIMD.
You can also write native code and use MPI, treating the Xeon Phi as a conventional (non-offload) node (always remembering the PCIe I/O bottleneck)

answered Feb 05 '15 at 16:04

Taylor Kidd

1,463
1
9
11

Thanks. I'm mostly interested in using OpenMP/OpenACC for offloading the tasks. Would you happen to know how are the *threadblocks* assigned to the cores? Are the blocks scheduled somehow much like in the CUDA programming model (randomly)? Or there is always only one threadblock, which distributes threads across all the cores available? – Marc Andreson Feb 05 '15 at 19:46
1

With OpenMP, Intel provides environment variables which allow you to specify where and how you would like the threads to be assigned. KMP_PLACE_THREADS lets you specify the number of cores you would like to use and how many threads you would like per core. KMP_AFFINITY lets you specify how you would like the threads to be assigned. A compact affinity will assign the threads in order. A scatter affinity will try to place the thread on the cores in round robin fashion. By the way, these variables work on other Intel processors as well. – froth Feb 05 '15 at 20:26
@froth where can I find more info about that? (i.e. those enviornmental variables as well as programming model for Intel Xeon Phi?) – Marc Andreson Feb 06 '15 at 10:18
1

@MarcAndreson You can go to https://software.intel.com/en-us/xeonphi. There are articles on programming under the Programming tab, blogs and a forum where people have asked similar questions. And then there are the compiler and tools reference manuals and user guides that you can find under Tools->Documentation at the very top of the page. Disclaimer - this site is, as the URL implies, owned by Intel and the answers you find there will be focused on Intel products. There are also several book (dead tree and electronic) from the usual online book sellers. – froth Feb 06 '15 at 17:09

Coprocessor accelerators compared to GPUs

1 Answers1