Why Xeon Phi always got bad efficacy?

Question

I tried to run a for loop 1,000,000,000 times on Xeon E5 and Xeon Phi, and measurement time to compare their efficacy, I'm so surprise I got the following result:

On E5 (1 Thread): 41.563 Sec
On E5 (24 Threads): 22.788 Sec
Offload on Xeon Phi (240 Threads): 45.649 Sec

Can anybody tell me that why I get the bad efficacy? About architecture or any another?

Why I got the bad efficeny on Xeon Phi? I do nothing on the for loop. If my Xeon Phi coprocessor didn't had any problem, what work for Xeon Phi is great? Must be vectorization? if not vectorization, can I do any thing on Xeon Phi use its threads to help me something?

If you post some code it will be easier to see if there is anything that can help. It could be vectorisation but it could also be memory alignment, prefetching problems, compiler flags or a number of other possibilities. — amckinley, Nov 20 '14 at 10:47
The question is not definitively answerable without sample code. My guess is that your loop is either memory-bound, not vectorizable, or relies too much on OOE of desktop processors. — Mysticial, Nov 21 '14 at 23:38
Warning for future readers: This question and the answers appear to be based on first-gen Xeon Phi, KNC, not the later better KNL (Knight's Corner) — Peter Cordes, Nov 13 '19 at 00:31

score 2 · Accepted Answer · answered Nov 21 '14 at 23:34

The key is that you say, "I do nothing in the for loop." (Please correct me if I'm mistaken.)

Because of practical limits when the Xeon Phi was created, its cores are based upon a Pentium generation machine with various enhancements, such as dual issue, 4 threads per core, and the 512-bit vector engine. So if you are only running scalar code, it runs like a Pentium.

You need to run code that is both highly parallel and highly vectorizable. Even better if threads running on each core are able to share the core's pipeline without much contention, e.g. DGEMM, as well as take advantage of the cache structure.

By running a trivial benchmark, you are basically comparing the execution of code overhead on both your architectures (Xeon and Xeon Phi). And code overhead is typically scalar.

Here's an exaggerated illustration for us more visually inclined.

|<--Ovr-->|<--Work--------------->| repeat 10^6 times //Xeon Server

|<-----Ovr----->|<-Work->| repeat 10^6 times //Xeon Phi

Where "Ovr" is overhead, and "Work" is your highly threaded and vectorized workload.

If you have "Work" to do, then the Xeon Phi does better. If you remove the "Work", leaving only the overhead, the Xeon does better.

Thank you very much, you let me know what I detect in my example clearly, I knew if I want to get perfect efficacy on Xeon Phi should parallelism and vectorization, but today, I want to get all possible sub-string from a long string(m x n problem). It's not a probleam that can vectorization, is it absolutely not possible do it well on Xeon Phi? In my point, if any problem can't vectorization, it well get bad performance on Xeon Phi, right? — Marcus Wu, Nov 22 '14 at 13:27
Don't vectorize, parallelize. Xeon Phi isn't based on old Pentium cores just because that's what they had - the whole point is to have many weak cores, so you get to have better parallelism ("thread"-level or task-level instead of instruction-level) for simple tasks that are easy to break down and distribute. — Leeor, Nov 23 '14 at 18:35
Po Chang, memory bandwidth works as well. Since the cores are several generations old, very good thread-based parallelism is necessary but not enough. You also need a demand for either vectorization or memory bandwidth beyond what the Xeon can provide. — Taylor Kidd, Nov 24 '14 at 19:56

Computer architect · Answer 2 · 2014-12-18T17:49:17.200

2

Xeon Phi sucks. In moderately parallel applications traditional xeons trounce xeon Phi, in massively parallel applications GPGPUs rule. Xeon Phi is only marginally competitive when you can perfectly parallelize AND vectorize your application if either one is not perfect forget Xeon Phi.

EDIT: Some examples where xeon phi works either worse than traditional xeons or worse than GPGPUs:

blog.xcelerit.com/intel-xeon-phi-vs-nvidia-tesla-gpu/

http://www.delaat.net/awards/2014-03-26-paper.pdf

https://verc.enes.org/ISENES2/documents/Talks/WS3HH/session-4-hpc-software-challenges-solutions-for-the-climate-community/markus-rampp-mic-experiences-at-mpg

edited Dec 18 '14 at 17:49

answered Dec 18 '14 at 17:08

Computer architect

49
2

2

All (independent from intel) comparisons I've see got similar results, xeon phi (thanks for catching-up the typo) is either performing worse than traditional xeons or worse than GPGPUs. – Computer architect Dec 18 '14 at 17:43
Nah, the Phi doesn't suck. Your answer doesn't really address the question. – Daniel Paull Jun 02 '15 at 06:58
2

This answer might fail to address the original question, but it is true that Xeon Phi isn't as competitive as either normal Xeons or Nvidia GPUs. A few factors result in the small number of studies on Phi's performance: 1) researchers have less tendency to publish results demonstrating something is bad. 2) Second-gen Xeon Phi, the KNL, isn't massively publicly available now, and many labs got Xeon Phi from Intel's loaner program need Intel's permission to publish results on these Phis. – Samuel Li Mar 11 '17 at 01:26
I agree with @Computerarchitect, I experimented with tensorflow (mkl binary) and keras, and Xeon Phi Knights Landing (68 cores) performs with my experiments worse than a E5-2697 v4 (18 cores) CPU (KL is 30% slower than CPU with my original code). By adapting the small sequential part of my keras code, I obtain comparable speeds. But you have to adapt your code to the architecture, that is very uncomfortable and only to perform at the same speeds as CPU. For deep learning Intel Phi is useless when compared to GPGPU. – Fabiano Tarlao Oct 01 '18 at 12:48

Vahid Noormofidi · Answer 3 · 2015-09-22T22:43:48.983

First, you have to utilize the entire chip, i.e., utilize SIMD units as well. Second, in order to utilize the Xeon Phi processor, the pipeline must not remain idle, i.e., there has to be always enough instruction inside the pipeline. In your benchmark no instruction is issued, so you basically measured the launch of an empty loop (which is likely optimized out by your compiler) and due to CPU's higher clock, runs faster on CPU.

In addition, in my benchmarks I found that the Xeon Phi's performance is very sensitive to the length of the innermost loop (that runs on SIMD units).

Why Xeon Phi always got bad efficacy?

3 Answers3