Why is my GPU slower than CPU in matrix operations?

Question

CPU: i7-9750 @2.6GHz (with 16G DDR4 Ram); GPU: Nvidia Geforce GTX 1600 TI (6G); OS: Windows 10-64bit

I tried to see how fast the GPU is in doing basic matrix operations compared with CPU, and I basically followed this https://towardsdatascience.com/heres-how-to-use-cupy-to-make-numpy-700x-faster-4b920dda1f56. The following is my super simple code

import numpy as np
import cupy as cp
import time

### Numpy and CPU
s = time.time()
A = np.random.random([10000,10000]); B = np.random.random([10000,10000])
CPU = np.matmul(A,B); CPU *= 5
e = time.time()
print(f'CPU time: {e - s: .2f}')

### CuPy and GPU
s = time.time()
C= cp.random.random([10000,10000]); D = cp.random.random([10000,10000])
GPU = cp.matmul(C,D); GPU *= 5
cp.cuda.Stream.null.synchronize()  
# to let the code finish executing on the GPU before calculating the time
e = time.time()
print(f'GPU time: {e - s: .2f}')

Ironically, it shows CPU time: 11.74 GPU time: 12.56

This really confuse me. How could the GPU be even slower than CPU on large matrix operations? Note that I even have not applied parallel computing (I am a beginner and I am not sure whether the system will open it for me or not.) I did have checked similar questions such as Why is my CPU doing matrix operations faster than GPU instead?. But here I am using cupy rather than mxnet (cupy is newer and designed for GPU computing).

Can someone help? I woud really appreciate!

@ Klaus D. well, you can copy and paste the code and run on your computer and see the result. I guess similar. I don't know whether or not it's because my GPU memory is too small (only 6G; by contrast, the RAM is 16G DDR4). I am just very very confused about the results in such an extremely simple example. — QuestionStudent, Oct 18 '20 at 04:55
Also be aware that when doing a first GPU computation in the process (`C= cp.random.random([10000,10000])` in your example), a CUDA context initialization will happen which may take several seconds. — kmaehashi, Oct 18 '20 at 14:26

Stripedbass · Accepted Answer · 2020-10-18T22:50:07.740

7

numpy random is generating floats (32bit) as default. Cupy random generates 64bit (double) by default. To make an apples to apples comparison, change the GPU random number generation like this:

C= cp.random.random([10000,10000], dtype=cp.float32)
D = cp.random.random([10000,10000], dtype=cp.float32)

I have different hardware (both CPU and GPU) than you, but once this change is made the GPU version is about 12x faster than cpu version. Generating both ndarray of random numbers, matrix multiplication and scalar multiplication using cupy takes less than one second in total

edited Oct 18 '20 at 22:50

answered Oct 18 '20 at 07:06

Stripedbass

194
1
8

That makes sense. I modified it and execute it on the same computer. The new results are: **CPU time: 12.27 GPU time: 1.09**. Thank you so much! – QuestionStudent Oct 19 '20 at 03:48

Why is my GPU slower than CPU in matrix operations?

1 Answers1