Use experimental devices support in PlaidML?

Question

I want to use PlaidML to speed up deep learning training on my Mac Pro computer. After installing PlaidML, I run "plaidml-setup", and received the following message:

PlaidML Setup (0.3.5)

Thanks for using PlaidML!

Some Notes:
  * Bugs and other issues: https://github.com/plaidml/plaidml
  * Questions: https://stackoverflow.com/questions/tagged/plaidml
  * Say hello: https://groups.google.com/forum/#!forum/plaidml-dev
  * PlaidML is licensed under the GNU AGPLv3

Default Config Devices:
   No devices.

Experimental Config Devices:
   llvm_cpu.0 : CPU (LLVM)
   opencl_amd_amd_radeon_pro_555_compute_engine.0 : AMD AMD Radeon Pro 555 Compute Engine (OpenCL)
   metal_amd_radeon_pro_460.0 : AMD Radeon Pro 460 (Metal)
   opencl_intel_intel(r)_hd_graphics_630.0 : Intel Inc. Intel(R) HD Graphics 630 (OpenCL)
   opencl_cpu.0 : Intel CPU (OpenCL)
   metal_intel(r)_hd_graphics_unknown.0 : Intel(R) HD Graphics Unknown (Metal)

Using experimental devices can cause poor performance, crashes, and other nastiness.

Enable experimental device support? (y,n)[n]:

Why does it say this is 'experimental devices'? Is this normal to configure PlaidML on Mac Pro?

Should I click "yes" to proceed the setup?

EDIT: After I click 'yes', I was presented with another set of options:

Multiple devices detected (You can override by setting PLAIDML_DEVICE_IDS). Please choose a default device:

1 : llvm_cpu.0
   2 : opencl_amd_amd_radeon_pro_555_compute_engine.0
   3 : metal_amd_radeon_pro_460.0
   4 : opencl_intel_intel(r)_hd_graphics_630.0
   5 : opencl_cpu.0
   6 : metal_intel(r)_hd_graphics_unknown.0

Default device? (1,2,3,4,5,6)[1]:

Which one should I choose? Or it doesn't matter?

score 2 · Answer 1 · answered Apr 15 '19 at 14:51

What version of macOS are you running? What year is the machine? I suspect for older machines or macOS < 10.14, you don't see a default, because PlaidML has heeded to Apple's deprecation of OpenGL/CL in 10.14 in favor of Metal.

FWIW, on my machine I see similar options, except the metal devices are listed under "Default Config Devices."

As for each of these options ~~briefly~~ (okay, maybe I got carried away) explained:

You can train/run ML models on CPUs or GPUs. CPUs aren't as well suited for the pipelines of matrix math that are common in ML applications. Moderm CPUs have Streaming SIMD Extensions (SIMD means Single Instruction Multiple Data) or SSE. These allow you to do a more limited set of matrix-like operations. For example, when adding two vectors instead of considering each pair of elements and adding them one-by-one, SIMD allows you to add many numbers at once. For example, compiling the following code with clang -O3 -march=native:

#include <array>

auto add(std::array<float, 64> a, std::array<float, 64> b) {
    std::array<float, 64> output;

    for (size_t i = 0; i < 64; i++) {
        output[i] = a[i] + b[i];
    }

    return output;
}

We can see two different compilations depending on whether we pass -mno-sse (which as you might guess, produces a binary that works on CPUs without SSE). With SSE:

add(std::array<float, 64ul>, std::array<float, 64ul>):
        mov     rax, rdi
        vmovups zmm0, zmmword ptr [rsp + 8]
        vaddps  zmm0, zmm0, zmmword ptr [rsp + 264]
        vmovups zmmword ptr [rdi], zmm0
        vmovups zmm0, zmmword ptr [rsp + 72]
        vaddps  zmm0, zmm0, zmmword ptr [rsp + 328]
        vmovups zmmword ptr [rdi + 64], zmm0
        vmovups zmm0, zmmword ptr [rsp + 136]
        vaddps  zmm0, zmm0, zmmword ptr [rsp + 392]
        vmovups zmmword ptr [rdi + 128], zmm0
        vmovups zmm0, zmmword ptr [rsp + 200]
        vaddps  zmm0, zmm0, zmmword ptr [rsp + 456]
        vmovups zmmword ptr [rdi + 192], zmm0
        vzeroupper
        ret

Without SSE:

add(std::array<float, 64ul>, std::array<float, 64ul>):
        mov     rax, rdi
        lea     rcx, [rsp + 264]
        lea     rdx, [rsp + 8]
        xor     esi, esi
.LBB0_1:
        fld     dword ptr [rdx + 4*rsi]
        fadd    dword ptr [rcx + 4*rsi]
        fstp    dword ptr [rax + 4*rsi]
        fld     dword ptr [rdx + 4*rsi + 4]
        fadd    dword ptr [rcx + 4*rsi + 4]
        fstp    dword ptr [rax + 4*rsi + 4]
        fld     dword ptr [rdx + 4*rsi + 8]
        fadd    dword ptr [rcx + 4*rsi + 8]
        fstp    dword ptr [rax + 4*rsi + 8]
        fld     dword ptr [rdx + 4*rsi + 12]
        fadd    dword ptr [rcx + 4*rsi + 12]
        fstp    dword ptr [rax + 4*rsi + 12]
        fld     dword ptr [rdx + 4*rsi + 16]
        fadd    dword ptr [rcx + 4*rsi + 16]
        fstp    dword ptr [rax + 4*rsi + 16]
        fld     dword ptr [rdx + 4*rsi + 20]
        fadd    dword ptr [rcx + 4*rsi + 20]
        fstp    dword ptr [rax + 4*rsi + 20]
        fld     dword ptr [rdx + 4*rsi + 24]
        fadd    dword ptr [rcx + 4*rsi + 24]
        fstp    dword ptr [rax + 4*rsi + 24]
        fld     dword ptr [rdx + 4*rsi + 28]
        fadd    dword ptr [rcx + 4*rsi + 28]
        fstp    dword ptr [rax + 4*rsi + 28]
        add     rsi, 8
        cmp     rsi, 64
        jne     .LBB0_1
        ret

You don't need to deeply understand what's going on here, but notice that the instructions that begin with v in the SSE binary. Those are AVX instructions. And the zmm0 is an AVX register that can hold 16 floats (AVX-512 provides 512 bit registers, floats are 32 bits). LLVM takes advantage of this and instead of adding the numbers element by element (like we wrote in our original code) it does them 16 at a time. You see 4 variations of the following assembly one after the other (pay attention to the math inside the parenthesis):

vmovups zmm0, zmmword ptr [rsp + (8 + 64*N)]
vaddps  zmm0, zmm0, zmmword ptr [rsp + (8 + 4*64 + 64*N)]
vmovups zmmword ptr [rdi + (64*N)], zmm0

The math here requires a bit of knowledge about the System V call ABI. Simply put, ignore the 8 +. [rsp + 64*N] gets you a[16*N] to a[16*(N+1)], exclusive. [rsp + (4*64 + 64*N)] skips all of a (a is 64 floats each of size 4 bytes) and gets you b[16*N] to b[16*(N+1)], exclusive. And [rdi + (64*N)] is output[16*N] to output[16*(N+1)], exclusive. So this effectively translates to the following pseudocode:

std::array<float, 16> temp = {a[16*N], a[16*N+1], ..., a[16*N+16]};
temp += {b[16*N], b[16*N+1], ..., b[16*N+16]};
{output[16*n], output[16*N+1], ..., output[16*N+16]} = temp;

So indeed, we see that AVX-512 (an extension to SIMD) allows us to do the addition in chunks of 16 numbers at a time. Compare this quickly to the -mno-sse version. It should be clear that it's doing a lot more work. Again we have a pattern of instructions (although this time it's in a loop):

fld     dword ptr [rdx + 4*rsi + 4*N]
fadd    dword ptr [rcx + 4*rsi + 4*N]
fstp    dword ptr [rax + 4*rsi + 4*N]

There are eight of these (with N ranging from 0 to 8, exclusive). This is wrapped in a loop which repeats 8 times (8 * 8 = 64, the array length). You should be able to guess what's going on here. It's very similar to above, except we work on one number at a time instead of 16. fld is similar to vmovups, fadd is similar to vaddps. The pseudocode for this would look more like the code we actually wrote:

float temp = a[loop_num*8 + N];
temp += b[loop_num*8 + N];
output[loop_num*8] = temp;

Hopefully, it is intuitive that it will be much more efficient to do things 16 at a time than 1 at a time.

There are also fancy linear algebra frameworks like blas, which can squeeze just about all the performance you can get out of a CPU when it comes to math.

GPUs work a bit differently. A gross simplification would be to think of a GPU as a device with huge SIMD instructions (particularly suited for floating point operations). So instead of working 16 at a time, imagine just handing it an entire image and in one operation it can apply a pixel-filter to it (like changing the brightness or saturation).

So what does that tangent have to do with anything?

AVX instructions make it somewhat reasonable to run some code on the CPU. All the options you see with _cpu in them will only run on the CPU. llvm_cpu will likely use similar techniques to above that clang used (clang uses llvm behind the scenes) to compile all of the math necessary to run/train your ML models. Given that modern CPUs are multicore this can be as much as a 16 * number_of_cores speedup.

OpenCL is an open standard for writing math computations and easily running them on various hardware (including GPUs). OpenCL also can be emulated by CPUs (admittedly at a much slower rate--remember CPUs can only do 16x, GPUs can do much more).

Metal is Apple's replacement for OpenGL/CL. It accomplishes similar things, but is macOS specific (and closed source).

The only difference left to comment on is "Intel(R) HD Graphics 630" vs "AMD Radeon 460." Your computer has two GPUs. The first one is an integrated graphics card. The integrated here means that your Intel CPU has a little GPU embedded inside of it. It isn't as performant as a discrete GPU (one that's separate from the CPU, often found in card form factors for desktops), but it gets the job done for certain less intensive graphics tasks (and typically is more power efficient). Your AMD Radeon 460 is a discrete GPU. It will likely be the most powerful piece of hardware you have for this task.

So with that in mind, I predict the devices will be, fastest to slowest:

metal_amd_radeon_pro_460.0 - Discrete GPUs are fast, Apple has optimized Metal to work very well on new Macs
opencl_amd_amd_radeon_pro_555_compute_engine.0 - This still uses the discrete GPU, but OpenCL has been neglected a bit and is now deprecated on macOS, so it likely won't be as fast
metal_intel(r)_hd_graphics_unknown.0 - Integrated GPUs are better than CPUs, Apple has optimized Metal
opencl_intel_intel(r)_hd_graphics_630.0 - ditto regarding the other OpenCL (except this is an integrated not discrete GPU)
llvm_cpu.0 - This uses the CPU, but LLVM is pretty good at writing efficient SIMD code.
opencl_cpu.0 - This emulates (2) and (4) except using your CPU, which will be much slower. Additionally, it likely doesn't have all the fancy algorithms LLVM uses to output efficient SIMD code.

But all this is speculation, you can test it by pip install plaidbench plaidml-keras keras. For each device, run plainml-setup (selecting that device) and then run plainbench keras mobilenet (or any of the other benchmarks). Here are the results I see on my machine:


|            device            | exeuction (s) |  fps   | correctness |
|------------------------------|---------------|--------|-------------|
| Metal AMD Radeon Pro 560     |         9.009 | 112.53 | PASS        |
| OpenCL AMD Radeon Pro 560    |        18.339 |  93.29 | PASS        |
| OpenCL Intel HD Graphics 630 |        23.204 |  60.18 | FAIL        |
| Metal Intel HD Graphics 630  |        24.809 |  41.27 | PASS        |
| LLVM CPU                     |        66.072 |  16.82 | PASS        |
| OpenCL CPU Emulation         |       155.639 |   6.71 | FAIL        |

I've renamed the devices to have prettier names, but their mapping to the identifiers should be obvious.

Execution time is time it took to run the model (lower is better) and FPS is the FPS that the execution achieved (higher is better).

We note that the order is generally what we expected. Discrete GPU is faster than Integrated GPU, which is faster than CPU. An important thing to call out is that OpenCL on the integrated GPU and CPU emulation failed the correctness check. The CPU emulation was only off by a factor of about 7%, but the integrated GPU was off by about 77%. You probably only want to choose a device that passes the correctness check on your machine (it's possible--but not guaranteed--that the backend or device itself is buggy if it fails that check).

tl;dr Use metal + discrete GPU (AMD Radeon). It is the fastest device you have available. Using anything CPU-based will only spin up your fans and consume a ton of power (and take forever to finish/train).

score 1 · Answer 2 · answered Sep 09 '19 at 16:00

Yes you absolutely need experimental support to use PlaidML, period. After that, you want to choose

3: metal_amd_radeon_pro_460.0

or anything that says "metal" and "radeon" (or NVIVIA, if you have it and prefer that). There is little point to using Intel UHD graphics (even if you can by choosing 6 : metal_intel(r)_hd_graphics_unknown.0) since it's inferior to a discrete GPU.

Apple has deprecated OpenCL in favor of Apple's Metal framework, and recently the OpenCL plaid-setups are getting Fail errors on plaidbench. For example, if you used the opencl driver, you will be guaranteed a Fail error when you run

plaidbench keras mobilenet

You will most likely get a Success with a metal driver.

Use experimental devices support in PlaidML?

2 Answers2