What version of macOS are you running? What year is the machine? I suspect for older machines or macOS < 10.14, you don't see a default, because PlaidML has heeded to Apple's deprecation of OpenGL/CL in 10.14 in favor of Metal.
FWIW, on my machine I see similar options, except the metal devices are listed under "Default Config Devices."
As for each of these options briefly (okay, maybe I got carried away) explained:
You can train/run ML models on CPUs or GPUs. CPUs aren't as well suited for the pipelines of matrix math that are common in ML applications. Moderm CPUs have Streaming SIMD Extensions (SIMD means Single Instruction Multiple Data) or SSE. These allow you to do a more limited set of matrix-like operations. For example, when adding two vectors instead of considering each pair of elements and adding them one-by-one, SIMD allows you to add many numbers at once. For example, compiling the following code with clang -O3 -march=native
:
#include <array>
auto add(std::array<float, 64> a, std::array<float, 64> b) {
std::array<float, 64> output;
for (size_t i = 0; i < 64; i++) {
output[i] = a[i] + b[i];
}
return output;
}
We can see two different compilations depending on whether we pass -mno-sse
(which as you might guess, produces a binary that works on CPUs without SSE). With SSE:
add(std::array<float, 64ul>, std::array<float, 64ul>):
mov rax, rdi
vmovups zmm0, zmmword ptr [rsp + 8]
vaddps zmm0, zmm0, zmmword ptr [rsp + 264]
vmovups zmmword ptr [rdi], zmm0
vmovups zmm0, zmmword ptr [rsp + 72]
vaddps zmm0, zmm0, zmmword ptr [rsp + 328]
vmovups zmmword ptr [rdi + 64], zmm0
vmovups zmm0, zmmword ptr [rsp + 136]
vaddps zmm0, zmm0, zmmword ptr [rsp + 392]
vmovups zmmword ptr [rdi + 128], zmm0
vmovups zmm0, zmmword ptr [rsp + 200]
vaddps zmm0, zmm0, zmmword ptr [rsp + 456]
vmovups zmmword ptr [rdi + 192], zmm0
vzeroupper
ret
Without SSE:
add(std::array<float, 64ul>, std::array<float, 64ul>):
mov rax, rdi
lea rcx, [rsp + 264]
lea rdx, [rsp + 8]
xor esi, esi
.LBB0_1:
fld dword ptr [rdx + 4*rsi]
fadd dword ptr [rcx + 4*rsi]
fstp dword ptr [rax + 4*rsi]
fld dword ptr [rdx + 4*rsi + 4]
fadd dword ptr [rcx + 4*rsi + 4]
fstp dword ptr [rax + 4*rsi + 4]
fld dword ptr [rdx + 4*rsi + 8]
fadd dword ptr [rcx + 4*rsi + 8]
fstp dword ptr [rax + 4*rsi + 8]
fld dword ptr [rdx + 4*rsi + 12]
fadd dword ptr [rcx + 4*rsi + 12]
fstp dword ptr [rax + 4*rsi + 12]
fld dword ptr [rdx + 4*rsi + 16]
fadd dword ptr [rcx + 4*rsi + 16]
fstp dword ptr [rax + 4*rsi + 16]
fld dword ptr [rdx + 4*rsi + 20]
fadd dword ptr [rcx + 4*rsi + 20]
fstp dword ptr [rax + 4*rsi + 20]
fld dword ptr [rdx + 4*rsi + 24]
fadd dword ptr [rcx + 4*rsi + 24]
fstp dword ptr [rax + 4*rsi + 24]
fld dword ptr [rdx + 4*rsi + 28]
fadd dword ptr [rcx + 4*rsi + 28]
fstp dword ptr [rax + 4*rsi + 28]
add rsi, 8
cmp rsi, 64
jne .LBB0_1
ret
You don't need to deeply understand what's going on here, but notice that the instructions that begin with v
in the SSE binary. Those are AVX instructions. And the zmm0
is an AVX register that can hold 16 float
s (AVX-512 provides 512 bit registers, float
s are 32 bits). LLVM takes advantage of this and instead of adding the numbers element by element (like we wrote in our original code) it does them 16 at a time. You see 4 variations of the following assembly one after the other (pay attention to the math inside the parenthesis):
vmovups zmm0, zmmword ptr [rsp + (8 + 64*N)]
vaddps zmm0, zmm0, zmmword ptr [rsp + (8 + 4*64 + 64*N)]
vmovups zmmword ptr [rdi + (64*N)], zmm0
The math here requires a bit of knowledge about the System V call ABI. Simply put, ignore the 8 +
. [rsp + 64*N]
gets you a[16*N]
to a[16*(N+1)]
, exclusive. [rsp + (4*64 + 64*N)]
skips all of a
(a
is 64 floats
each of size 4 bytes) and gets you b[16*N]
to b[16*(N+1)]
, exclusive. And [rdi + (64*N)]
is output[16*N]
to output[16*(N+1)]
, exclusive. So this effectively translates to the following pseudocode:
std::array<float, 16> temp = {a[16*N], a[16*N+1], ..., a[16*N+16]};
temp += {b[16*N], b[16*N+1], ..., b[16*N+16]};
{output[16*n], output[16*N+1], ..., output[16*N+16]} = temp;
So indeed, we see that AVX-512 (an extension to SIMD) allows us to do the addition in chunks of 16 numbers at a time. Compare this quickly to the -mno-sse
version. It should be clear that it's doing a lot more work. Again we have a pattern of instructions (although this time it's in a loop):
fld dword ptr [rdx + 4*rsi + 4*N]
fadd dword ptr [rcx + 4*rsi + 4*N]
fstp dword ptr [rax + 4*rsi + 4*N]
There are eight of these (with N
ranging from 0 to 8, exclusive). This is wrapped in a loop which repeats 8 times (8 * 8 = 64, the array length). You should be able to guess what's going on here. It's very similar to above, except we work on one number at a time instead of 16. fld
is similar to vmovups
, fadd
is similar to vaddps
. The pseudocode for this would look more like the code we actually wrote:
float temp = a[loop_num*8 + N];
temp += b[loop_num*8 + N];
output[loop_num*8] = temp;
Hopefully, it is intuitive that it will be much more efficient to do things 16 at a time than 1 at a time.
There are also fancy linear algebra frameworks like blas, which can squeeze just about all the performance you can get out of a CPU when it comes to math.
GPUs work a bit differently. A gross simplification would be to think of a GPU as a device with huge SIMD instructions (particularly suited for floating point operations). So instead of working 16 at a time, imagine just handing it an entire image and in one operation it can apply a pixel-filter to it (like changing the brightness or saturation).
So what does that tangent have to do with anything?
AVX instructions make it somewhat reasonable to run some code on the CPU. All the options you see with _cpu
in them will only run on the CPU. llvm_cpu
will likely use similar techniques to above that clang
used (clang
uses llvm behind the scenes) to compile all of the math necessary to run/train your ML models. Given that modern CPUs are multicore this can be as much as a 16 * number_of_cores
speedup.
OpenCL is an open standard for writing math computations and easily running them on various hardware (including GPUs). OpenCL also can be emulated by CPUs (admittedly at a much slower rate--remember CPUs can only do 16x, GPUs can do much more).
Metal is Apple's replacement for OpenGL/CL. It accomplishes similar things, but is macOS specific (and closed source).
The only difference left to comment on is "Intel(R) HD Graphics 630" vs "AMD Radeon 460." Your computer has two GPUs. The first one is an integrated graphics card. The integrated here means that your Intel CPU has a little GPU embedded inside of it. It isn't as performant as a discrete GPU (one that's separate from the CPU, often found in card form factors for desktops), but it gets the job done for certain less intensive graphics tasks (and typically is more power efficient). Your AMD Radeon 460 is a discrete GPU. It will likely be the most powerful piece of hardware you have for this task.
So with that in mind, I predict the devices will be, fastest to slowest:
metal_amd_radeon_pro_460.0
- Discrete GPUs are fast, Apple has optimized Metal to work very well on new Macs
opencl_amd_amd_radeon_pro_555_compute_engine.0
- This still uses the discrete GPU, but OpenCL has been neglected a bit and is now deprecated on macOS, so it likely won't be as fast
metal_intel(r)_hd_graphics_unknown.0
- Integrated GPUs are better than CPUs, Apple has optimized Metal
opencl_intel_intel(r)_hd_graphics_630.0
- ditto regarding the other OpenCL (except this is an integrated not discrete GPU)
llvm_cpu.0
- This uses the CPU, but LLVM is pretty good at writing efficient SIMD code.
opencl_cpu.0
- This emulates (2) and (4) except using your CPU, which will be much slower. Additionally, it likely doesn't have all the fancy algorithms LLVM uses to output efficient SIMD code.
But all this is speculation, you can test it by pip install plaidbench plaidml-keras keras
. For each device, run plainml-setup
(selecting that device) and then run plainbench keras mobilenet
(or any of the other benchmarks). Here are the results I see on my machine:
| device | exeuction (s) | fps | correctness |
|------------------------------|---------------|--------|-------------|
| Metal AMD Radeon Pro 560 | 9.009 | 112.53 | PASS |
| OpenCL AMD Radeon Pro 560 | 18.339 | 93.29 | PASS |
| OpenCL Intel HD Graphics 630 | 23.204 | 60.18 | FAIL |
| Metal Intel HD Graphics 630 | 24.809 | 41.27 | PASS |
| LLVM CPU | 66.072 | 16.82 | PASS |
| OpenCL CPU Emulation | 155.639 | 6.71 | FAIL |
I've renamed the devices to have prettier names, but their mapping to the identifiers should be obvious.
Execution time is time it took to run the model (lower is better) and FPS is the FPS that the execution achieved (higher is better).
We note that the order is generally what we expected. Discrete GPU is faster than Integrated GPU, which is faster than CPU. An important thing to call out is that OpenCL on the integrated GPU and CPU emulation failed the correctness check. The CPU emulation was only off by a factor of about 7%, but the integrated GPU was off by about 77%. You probably only want to choose a device that passes the correctness check on your machine (it's possible--but not guaranteed--that the backend or device itself is buggy if it fails that check).
tl;dr Use metal + discrete GPU (AMD Radeon). It is the fastest device you have available. Using anything CPU-based will only spin up your fans and consume a ton of power (and take forever to finish/train).