Why is Numpy with Ryzen Threadripper so much slower than Xeon?

Question

I know that Numpy can use different backends like OpenBLAS or MKL. I have also read that MKL is heavily optimized for Intel, so usually people suggest to use OpenBLAS on AMD, right?

I use the following test code:

import numpy as np

def testfunc(x):
    np.random.seed(x)
    X = np.random.randn(2000, 4000)
    np.linalg.eigh(X @ X.T)

%timeit testfunc(0)

I have tested this code using different CPUs:

On Intel Xeon E5-1650 v3, this code performs in 0.7s using 6 out of 12 cores.
On AMD Ryzen 5 2600, this code performs in 1.45s using all 12 cores.
On AMD Ryzen Threadripper 3970X, this code performs in 1.55s using all 64 cores.

I am using the same Conda environment on all three systems. According to np.show_config(), the Intel system uses the MKL backend for Numpy (libraries = ['mkl_rt', 'pthread']), whereas the AMD systems use OpenBLAS (libraries = ['openblas', 'openblas']). The CPU core usage was determined by observing top in a Linux shell:

For the Intel Xeon E5-1650 v3 CPU (6 physical cores), it shows 12 cores (6 idling).
For the AMD Ryzen 5 2600 CPU (6 physical cores), it shows 12 cores (none idling).
For the AMD Ryzen Threadripper 3970X CPU (32 physical cores), it shows 64 cores (none idling).

The above observations give rise to the following questions:

Is that normal, that linear algebra on up-to-date AMD CPUs using OpenBLAS is that much slower than on a six-year-old Intel Xeon? (also addressed in Update 3)
Judging by the observations of the CPU load, it looks like Numpy utilizes the multi-core environment in all three cases. How can it be that the Threadripper is even slower than the Ryzen 5, even though it has almost six times as many physical cores? (also see Update 3)
Is there anything that can be done to speed up the computations on the Threadripper? (partially answered in Update 2)

Update 1: The OpenBLAS version is 0.3.6. I read somewhere, that upgrading to a newer version might help, however, with OpenBLAS updated to 0.3.10, the performance for testfunc is still 1.55s on AMD Ryzen Threadripper 3970X.

Update 2: Using the MKL backend for Numpy in conjunction with the environment variable MKL_DEBUG_CPU_TYPE=5 (as described here) reduces the run time for testfunc on AMD Ryzen Threadripper 3970X to only 0.52s, which is actually more or less satisfying. FTR, setting this variable via ~/.profile did not work for me on Ubuntu 20.04. Also, setting the variable from within Jupyter did not work. So instead I put it into ~/.bashrc which works now. Anyways, performing 35% faster than an old Intel Xeon, is this all we get, or can we get more out of it?

Update 3: I play around with the number of threads used by MKL/OpenBLAS:

The run times are reported in seconds. The best result of each column is underlined. I used OpenBLAS 0.3.6 for this test. The conclusions from this test:

The single-core performance of the Threadripper using OpenBLAS is a bit better than the single-core performance of the Xeon (11% faster), however, its single-core performance is even better when using MKL (34% faster).
The multi-core performance of the Threadripper using OpenBLAS is ridiculously worse than the multi-core performance of the Xeon. What is going on here?
The Threadripper performs overall better than the Xeon, when MKL is used (26% to 38% faster than Xeon). The overall best performance is achieved by the Threadripper using 16 threads and MKL (36% faster than Xeon).

Update 4: Just for clarification. No, I do not think that (a) this or (b) that answers this question. (a) suggests that "OpenBLAS does nearly as well as MKL", which is a strong contradiction to the numbers I observed. According to my numbers, OpenBLAS performs ridiculously worse than MKL. The question is why. (a) and (b) both suggest using MKL_DEBUG_CPU_TYPE=5 in conjunction with MKL to achieve maximum performance. This might be right, but it does neither explain why OpenBLAS is that dead slow. Neither it explains, why even with MKL and MKL_DEBUG_CPU_TYPE=5 the 32-core Threadripper is only 36% faster than the six-year-old 6-core Xeon.

maybe relevant https://www.pugetsystems.com/labs/hpc/AMD-Ryzen-3900X-vs-Intel-Xeon-2175W-Python-numpy---MKL-vs-OpenBLAS-1560/ also Google openblas vs MKL — qwr, Jul 07 '20 at 21:01
I'd suspect inter-core latency might be a bigger issue across CCX clusters of 4 cores on Threadripper? 3970X is a [Zen 2](https://en.wikichip.org/wiki/amd/microarchitectures/zen_2) part, so it should have 2x 256-bit SIMD FMA throughput (per core), same as Intel Haswell. Perhaps a library tuned for AMD is only using 128-bit SIMD because that was sometimes better for Zen1. (Your Ryzen 5 2600 *is* a Zen1, 1x 128-bit FMA uop per clock, so it's crazy that it's slower than a Zen2). Different BLAS libraries might be a big factor. — Peter Cordes, Jul 07 '20 at 22:34
Perhaps using both logical cores of one physical core might be creating more cache misses; only using 6 cores on the Intel CPU leaves the full size of the private caches of each physical core for one thread. Also, what clock speeds are those chips running at? They should be similar. — Peter Cordes, Jul 07 '20 at 22:35
I'd advise to run comparisons with different number of threads (`OPENBLAS_NUM_THREADS`, `MKL_NUM_THREADS`). Server processors have slower per-core speed, and multicore speedups in BLAS libraries are usually very appalling. — amiasato, Jul 08 '20 at 00:45
Generating random numbers takes a lot of time (1/4 of total time on my system). It would be better to only get the timings of `np.linalg.eigh(X @ X.T)`. Also set the `MKL_NUM_THREADS` to the number of physical threads. This BLAS algortihms usually scale negative with virtual cores. — max9111, Jul 08 '20 at 08:12
For a broader overview you could also run ibench. Setting the OpenMP Thread granularity may also help. Adapting this https://github.com/fo40225/Anaconda-Windows-AMD `MKL_DEBUG_CPU_TYPE=5` for your 32 core CPU (xxx_cpuinfo.txt) — max9111, Jul 08 '20 at 08:32
Now that you have more perf ratios, including MKL on both machines, it would be even more useful / relevant to include *clock speeds* (specifically, the actual turbo clock speed your machine used when running those tests.) — Peter Cordes, Jul 08 '20 at 08:56
Intel documents the single-core max turbo, and you can just manually look at clock speeds while the benchmark is running. (`grep MHz /proc/cpuinfo` or whatever). Ideally run your program under `perf` on Linux: `perf stat my_benchmark` to record HW performance counters which includes the `cycles` event, and will calculate the average clock speed the CPU actually ran at over the benchmark interval. (By dividing `cycles` by the `task-clock` kernel event.) — Peter Cordes, Jul 08 '20 at 15:28
Thanks @PeterCordes However, I had to change the benchmark a bit, since now I measure the execution time using `perf stat` and not using Python's own `%timeit`-instruction. This means that now the `import numpy`-instruction also is measured. This leads to different results. This is why I decided to [summarize them in a Google Sheet](https://docs.google.com/spreadsheets/d/1CCdeSFh8wvzmYdIaQYOlrD0utLwCOP2Ih6m_BzTd0os/edit?usp=sharing) first: If you think that this experiments provides more important insights, I will replace the experiment and the results in my original question. Let me know! — theV0ID, Jul 08 '20 at 16:36
You could run the whole Python timeit under perf just to find out the average clock speed, with the actual timed interval still being measured by Python. Or fork off a `perf stat -p $PID` *after* initializing, so it attaches right as you're starting the benchmark. — Peter Cordes, Jul 08 '20 at 20:36
As far as I know: pandas, scikit, pytroch, tensorflow, matplotlib, IPython, Sympy and NumExpr using mkl, numpy is switching to openBLAS since 1.18. I was planning a threadripper workstation but I havent got the time and knowlege to compile every of these by my own. How you decide now? — Pablo, Jul 12 '20 at 18:02
@Pablo I'm running Numpy 1.18.5 with MKL and the `MKL_DEBUG_CPU_TYPE`-hack, the speed is ok. — theV0ID, Jul 12 '20 at 18:10
Does this answer your question? [When you have an AMD CPU, can you speed up code that uses the Intel-MKL?](https://stackoverflow.com/questions/63174453/when-you-have-an-amd-cpu-can-you-speed-up-code-that-uses-the-intel-mkl) your [question has "the same answers. This includes not only word-for-word duplicates, but also the same idea expressed in different words"](https://meta.stackexchange.com/questions/10841/how-should-duplicate-questions-be-handled). The linked question is more general (not specific to `Ryzen`/`python`/`numpy`). Disclaimer: The question I linked to is my own question. — Trevor Boyd Smith, Aug 03 '20 at 13:01
@theV0ID re `OpenBLAS performs ridiculously worse than MKL. The question is ... why OpenBLAS is that dead slow` is the same in my opinion (or **very** similar) as asking `why is an open-source software implementation slower than a closed-source software implementation?` which can not be answered because the closed-source software is not available. — Trevor Boyd Smith, Aug 03 '20 at 13:32
@theV0ID re `32-core Threadripper is only 36% faster than the six-year-old 6-core Xeon` how you profile and generate your measurements matters a lot when doing comparisons like this. If you post your benchmarking code, then someone could at least answer the question of 'why is my benchmarking code slower given x,y,z' (for example: Intel-Python benchmark code is [open-sourced to show how/why so much faster](https://software.intel.com/content/www/us/en/develop/tools/distribution-for-python/benchmarks.html)). — Trevor Boyd Smith, Aug 03 '20 at 14:18
even if Intel's code was open-source it would still require knowledge of the hardware implementation... which again isn't available... and so an answer isn't possible because that hardware implementation isn't available. — Trevor Boyd Smith, Aug 03 '20 at 14:19
@TrevorBoydSmith My benchmarking code is at the top of the original question — theV0ID, Aug 03 '20 at 15:44
@TrevorBoydSmith All I'm saying is that my measurements hugely contradict the observations in the link. According to the link, OpenBLAS is supposed to perform comparably good as MKL. It does not, not even close. The question is why. Someone else observed that OpenBLAS performs comparably good, so I do not believe that this boils down to open source vs closed source. — theV0ID, Aug 03 '20 at 15:45
Have you tried asking the maintainers at https://github.com/numpy/numpy. This is something I want to understand too — Akshay, Nov 29 '20 at 08:19
@Akshay Good idea, but I had no time yet. I will try to get that done the next days. — theV0ID, Dec 14 '20 at 15:26
your benchmark is absolutely wrong and looks like your are trying to make holywar more than real comparison. This part `X = np.random.randn(2000, 4000)` most probably doesn't parallelise well. You have to use internal numpy benchmark for fair comparison https://numpy.org/doc/stable/benchmarking.html. You have to limit number of cores too because your matrices or test conditions might be to weak(small) for available resources (CPU core count and etc) which actually slows instead of boosts. — Alex, Nov 19 '21 at 12:45

AstroTeen · Answer 1 · 2021-09-16T07:27:54.480

4

As of 2021, Intel unfortunately removed the MKL_DEBUG_CPU_TYPE to prevent people on AMD use the workaround presented in the accepted answer. This means that the workaround no longer works, and AMD users have to either switch to OpenBLAS or keep using MKL.

To use the workaround, follow this method:

Create a conda environment with conda's and NumPy's MKL=2019.
Activate the environment
Set MKL_DEBUG_CPU_TYPE = 5

The commands for the above steps:

conda create -n my_env -c anaconda python numpy mkl=2019.* blas=*=*mkl
conda activate my_env
conda env config vars set MKL_DEBUG_CPU_TYPE=5

And thats it!

edited Sep 16 '21 at 07:27

answered Aug 26 '21 at 17:08

AstroTeen

191
4
12

1

You do currently have enough rep to comment, thanks to your useful contributions getting upvotes :). This is actually a relevant answer for future readers facing the problem of slow MKL Numpy on AMD CPUs, though, so it's fine. In some cases it might be better to suggest an edit to an existing answer, pointing out that it doesn't work with the latest MKL, but here a separate answer makes as much sense as editing 3 different answers. Especially if you make this into an answer that does directly address the question here. – Peter Cordes Aug 26 '21 at 17:26
I think you can still use an older MKL version, right? At least, 2020.0 still works for me. – theV0ID Aug 27 '21 at 14:52
1

I use `mkl=2020.0` along with `blas=*=mkl` in my environment .yml files, however, I am not 100% sure that it works, since I have noticed some strange slow downs in a recently created environment. – theV0ID Aug 27 '21 at 18:27
There is no "accepted answer" on this question. It's usually not a good idea to copy/paste the identical answers onto different questions, since future editors will need to find them both / all. This should probably still be a link to [your answer on another question](https://stackoverflow.com/questions/63174453/when-you-have-an-amd-cpu-can-you-speed-up-code-that-uses-the-intel-mkl/68942702#68942702) for the full step-by-step guide, maybe just say here to use 2019 MKL with the `MKL_DEBUG_CPU_TYPE=5` environment setting, see that for full details. – Peter Cordes Sep 16 '21 at 07:37
And you can make the rest of this answer be specific to this question by describing what Intel's "cripple-AMD" function actually does. – Peter Cordes Sep 16 '21 at 07:39
I am confused: we are in October 2021 and typing `!export MKL_DEBUG_CPU_TYPE=5` before running my Python script still improved the overall processing time. – Sheldon Oct 16 '21 at 23:35
@Sheldon What is your MKL version? Use `mkl.get_version_string()` to find out. – AstroTeen Oct 17 '21 at 07:37
Thanks for your reply @Astro: `mkl.get_version_string()` yields 'Intel(R) oneAPI Math Kernel Library Version 2021.2-Product Build 20210312 for Intel(R) 64 architecture applications' – Sheldon Oct 17 '21 at 19:34
@Sheldon And how is that working? Because Intel removed that flag starting MKL 2021, and `!export MKL_DEBUG_CPU_TYPE=5` shouldn't give you any boost. It is bizarre how you are able to get increased performance. Are you sure that you are getting an improvement? If yes, let us know! – AstroTeen Oct 18 '21 at 17:48
People, it seems that the wokaround has been removed since it's *no longer necessary*: as of 2021, recent versions of MKL sould perform good even on Ryzens. I'd like Ryzen users to confirm this. – MadHatter Jan 30 '22 at 19:40

tryptofame · Answer 2 · 2020-08-13T17:21:32.450

2

Wouldn't it make sense to try using an optimized BLIS library from AMD?

Maybe I am missing (misunderstanding) something, but I would assume you could use BLIS instead of OpenBLAS. The only potential problem could be that AMD BLIS is optimized for AMD EPYC (but you're using Ryzen). I'm VERY curious about the results, since I'm in the process of buying a server for work, and am considering AMD EPYC and Intel Xeon.

Here are the respective AMD BLIS libraries: https://developer.amd.com/amd-aocl/

edited Aug 13 '20 at 17:21

answered Aug 13 '20 at 14:15

tryptofame

352
2
7
18

2

Even though installation of BLIS via conda looks easy, it seems non-straight forward to me how to make Numpy actually use BLIS as the backend. However, according to [this](https://news.ycombinator.com/item?id=21738644), MKL outperforms BLIS on Ryzen (*"with some quick/dirty results on my Ryzen 3700X [...] You can see performance basically double on MKL when `MKL_DEBUG_CPU_TYPE=5` is used"*). – theV0ID Aug 13 '20 at 15:13
How to compile and install numpy with BLIS linked to AMD's AOCL BLIS # download files from https://developer.amd.com/amd-aocl/ # unpack to e.g. /home/AOCL/2.2 # create ~/.numpy-site.cfg [blis] libraries = blis library_dirs = /home/AOCL/2.2/lib include_dirs = /home/AOCL/2.2/include runtime_library_dirs = /home/AOCL/2.2/lib # git clone https://github.com/numpy/numpy.git # cd numpy # pip install . – tryptofame Sep 10 '20 at 12:30

poloniki · Answer 3 · 2020-07-31T14:31:41.533

1

I think this should help:

"The best result in the chart is for the TR 3960x using MKL with the environment var MKL_DEBUG_CPU_TYPE=5. AND it is significantly better than the low optimization code path from MKL alone. AND,OpenBLAS does nearly as well as MKL with MKL_DEBUG_CPU_TYPE=5 set." https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AMD-Ryzen-and-Threadripper-CPU-s-Effectively-for-Python-Numpy-And-Other-Applications-1637/

How to set up: 'Make the setting permanent by entering MKL_DEBUG_CPU_TYPE=5 into the System Environment Variables. This has several advantages, one of them being that it applies to all instances of Matlab and not just the one opened using the .bat file' https://www.reddit.com/r/matlab/comments/dxn38s/howto_force_matlab_to_use_a_fast_codepath_on_amd/?sort=new

edited Jul 31 '20 at 14:31

answered Jul 31 '20 at 14:11

poloniki

1,110
1
6
11

If that fully explains the perf diff, this question is a duplicate of [When you have an AMD CPU, can you speed up code that uses the Intel-MKL?](https://stackoverflow.com/q/63174453) . (Those links with more details and test results might be good as a comment there.) – Peter Cordes Jul 31 '20 at 14:31
Yeah, I've been on that link before, but doesn't the *"OpenBLAS does nearly as well as MKL with MKL_DEBUG_CPU_TYPE=5"* actually **contradict** the performance measures I reported? OpenBLAS does *significantly worse* than MKL. – theV0ID Jul 31 '20 at 16:05
1

By strange coincidence I wrote the same solution a day earlier over here https://stackoverflow.com/a/63174454/52074 for a **more general question** about Intel-MKL that is not specific to `AMD-Ryzen` and not specific to `numpy`. One of [the comments on my solution pointed me over here](https://stackoverflow.com/questions/63174453/when-you-have-an-amd-cpu-can-you-speed-up-code-that-uses-the-intel-mkl/63174454?noredirect=1#comment111804813_63174454). – Trevor Boyd Smith Aug 03 '20 at 12:52

Why is Numpy with Ryzen Threadripper so much slower than Xeon?

3 Answers3

Linked