0

Suppose I have a single c/c++ app running on the host. there are few threads running on the host CPU and 50 threads running on the Xeon Phi cores.

How can I make sure that each of these 50 runs on its own Xeon Phi core and is never purged off the core cache (given the code is small enough).

Could someone please to outline a very general idea how to do this and which tool/API would be more suitable (for C/C++ code) ?

What is the fastest way to exchange data between the host thread-aggregator and the 50 Phi threads?

Given that the actual parallelism will be very limited - this application is going to be more like 51 thread plane application with some basic multithreading data sync.

Can I use conventional C/C++ compiler to create the app like this?

Boppity Bop
  • 9,613
  • 13
  • 72
  • 151
  • First a warning: this is probably a bad idea. If you insist on doing it anyway, it's going to depend on the API you're using. IIRC, the Phi is normally programmed via OpenCL. OpenCL allows you to "partition" a device with `clCreateSubDevices`. To restrict a thread to one core, you'd execute a task on a sub-device with only one compute device. – Jerry Coffin Apr 13 '14 at 12:38
  • I am sorry Jerry but I disagree. 'normally' means by somone who migrated from Nvidia card dev. There are also other people in the world. – Boppity Bop Apr 13 '14 at 12:41
  • Okay, let me try to be a little more general: the method that's available is going to depend on the API you use to program it, and you seem dead set on depriving us of that information, so all we can do is guess. – Jerry Coffin Apr 13 '14 at 12:43
  • no i am not. I edited the question to highlight that i am very opened to what to use (I called 'tools' what i probably wanted was 'API').. i want the shortest way - i dont have experience in OpenCL and i dont want my app to be portable to other platforms. so i want shortest and the **immediate** way to implement what i want. – Boppity Bop Apr 13 '14 at 12:46

1 Answers1

3

You have raised several questions:

  1. Yes, you can use conventional C program and compile it using regular Intel C/C++/Fortran compilers (known as Intel Composer XE) in order to generate binary being able to run on Intel Xeon Phi co-processor in either "native"/"symmetric" or "offload" modes. In simplest case - you just recompile your C/C++ program with -mmic and run it "natively" on Phi just "as is".

  2. Which API to use? Use OpenMP4.0 standard or Intel Cilk Plus programming models (actually set of pragmas or keywords applicable to C/C++). OpenCL, Intel TBB and likely OpenACC are also possible, but OpenMP and Cilk Plus have capability to express threading, vectorization and offload (i.e. 3 things essential for Xeon Phi programming) without re-factoring or rewriting "conventional C/C++/Fortran" program .

  3. Threads pinning: could be achieved via OpenMP affinity (see more details on MIC_KMP_AFFINITY below) or Intel TBB affinity stuff.

  4. The fastest way to exchange the data between the host and target Phi - is.. avoid any exchange -using MPI symmetric approach for example. However you seem to ask about "offload" programming model specifically, so using asynchronous offload you can achieve the best performance. At the same time synchronous offload is theoretically simpler in terms of programming, but worse in terms of achievable performance.

Overall, you tend to ask several general questions, so I would recommend to start from the very beginning - i.e. looking at following ~10-pages Dr. Dobbs manual or given Intel' intro document.


Threads pinning is more advanced topic and at the same time seems to be "most interesting" for you, so I will explicitly explain more:

zam
  • 1,664
  • 9
  • 16