Anaconda Accelerate dot product is 2x slower than plain NumPy

Question

Why does Anaconda Accelerate compute dot products slower than plain NumPy on Python 3? I'm using accelerate version 2.3.1 with accelerate_cudalib 2.0 installed, Python 3.5.2 Windows 10 64-bit.

import numpy as np
from accelerate.cuda.blas import dot as gpu_dot
import time

def numpydot():
    start= time.time()
    for i in range(100):
        np.dot(np.arange(1000000, dtype=np.float64), np.arange(1000000, dtype=np.float64))
    elapsedtime = time.time()-start
    return elapsedtime

def acceleratedot():
    start= time.time()
    for i in range(100):
        gpu_dot(np.arange(1000000, dtype=np.float64), np.arange(1000000, dtype=np.float64))
    elapsedtime = time.time()-start
    return elapsedtime


numpydot()
0.6446375846862793
acceleratedot()
1.33168363571167

Has Anaconda Accelerate claimed it to be better than BLAS dot product (the one used in NumPy)? — Divakar, May 27 '17 at 19:52
Anaconda Accelerate supposedly uses cuBLAS, which uses GPU. NumPy only uses CPU (and the developers of NumPy maintain Accelerate) — Default picture, May 27 '17 at 19:55
1) Use `timeit`, `time` isn't accurate enough to get senseable timings. 2) Create the arrays outside of the loop, otherwise you'll include the array creation time in the timings 3) Try with 2D or ND arrays, it's likely that the vector dot-product isn't optimal for GPU processing. 4) Are you sure that "the developers of NumPy maintain Accelerate"? 5) GPU generally performs best with `float32`, that might not be true for all GPUs but you could give it a try. — MSeifert, May 27 '17 at 20:34
Scalar (`dot`) product is a memory bound problem and you are transferring data from host to device multiple times, so those results are perfectly reasonable. Also unless you have a recent-ish Tesla GPU lying around, `float64` will offer 1/3th-1/8th of the `float32` performance on most GPUs. — romeric, May 27 '17 at 21:00
@MSeifert 1) time() is accurate to the millisecond; the two time deltas were 650 ms apart 2) I created the arrays inside the loop to give Accelerate a fighting chance; when creating them outside the loop it is 100x slower than NumPy 3) The vector dot product is the only dot product supported by Accelerate; they don't support matrix dot products (which is what I wanted anyway) 4) Who cares? 5) Accelerate is still ~1.5x slower with float32 — Default picture, May 27 '17 at 21:20
@romeric Anaconda Accelerate doesn't seem to have an option to create shared arrays in GPU memory, so it doesn't make sense why they'd support a dot product. Numbapro had shared arrays, but the code doesn't work anymore in Accelerate. — Default picture, May 27 '17 at 21:25
@JosephValles 1) `timeit` performs several repeats and disables the garbage collection. That way you get **accurate** timings (timings that measure the function execution time not the garbarge collection or background processes), especially if the code needs warmup. `time` is just the wrong tool to measure exection times. 2) Exactly, it would be complicated to analyze the performance if you include something **that slow** 3) okay, didn't know that 5) Without array creation and with `timeit`? Otherwise it's just to convolved to make comments on the `dot` performance. — MSeifert, May 27 '17 at 21:28
@MSeifert 1) I ran the code myself in an interactive interpreter and got the same results every time. With a big enough array, Accelerate is noticeably slower to compute single dot products. If it was garbage collection, then it wouldn't be suitable for production use either. 2) time() is more accurate for slower processes; that doesn't even make sense. 3) Without array creation accelerate is 100x slower as I already said. I'll post the timeit results after reinstalling Accelerate; it's giving me errors now. — Default picture, May 27 '17 at 21:55
@MSeifert Using timeit with array creation only in setup, gpu_dot using Accelerate takes `0.3505513576283502` versus NumPy, which gives `0.0023297911862220633`. NumPy is over 100x faster just as I found with time(). — Default picture, May 27 '17 at 22:17

score 0 · Answer 1 · answered May 28 '17 at 00:53

0

I figured out that shared arrays are created with Numba, a separate library. They have the documentation on their site.

answered May 28 '17 at 00:53

Default picture

710
5
12

Anaconda Accelerate dot product is 2x slower than plain NumPy

1 Answers1