I'm using Intel Vtune Amplifier XE 2013 to profile a parallel program running on a multicore CPU, in particular it is written in OpenCL and executed in Xeon Phi. I wonder how should be the exact interpretation of the results brought by Vtune, i.e.,
- Is it the value of the performance counter collected by a single thread or the whole core? (Assuming there are many cores in a CPU and many threads can be executed concurrently on a core, as in case of Xeon Phi).
- How did Vtune sample on a multicore CPU? Did it sample on a single core and report it, or sample on many cores and take the average?