3

CPU: i7-9750 @2.6GHz (with 16G DDR4 Ram); GPU: Nvidia Geforce GTX 1600 TI (6G); OS: Windows 10-64bit

I tried to see how fast the GPU is in doing basic matrix operations compared with CPU, and I basically followed this https://towardsdatascience.com/heres-how-to-use-cupy-to-make-numpy-700x-faster-4b920dda1f56. The following is my super simple code

import numpy as np
import cupy as cp
import time

### Numpy and CPU
s = time.time()
A = np.random.random([10000,10000]); B = np.random.random([10000,10000])
CPU = np.matmul(A,B); CPU *= 5
e = time.time()
print(f'CPU time: {e - s: .2f}')

### CuPy and GPU
s = time.time()
C= cp.random.random([10000,10000]); D = cp.random.random([10000,10000])
GPU = cp.matmul(C,D); GPU *= 5
cp.cuda.Stream.null.synchronize()  
# to let the code finish executing on the GPU before calculating the time
e = time.time()
print(f'GPU time: {e - s: .2f}')

Ironically, it shows CPU time: 11.74 GPU time: 12.56

This really confuse me. How could the GPU be even slower than CPU on large matrix operations? Note that I even have not applied parallel computing (I am a beginner and I am not sure whether the system will open it for me or not.) I did have checked similar questions such as Why is my CPU doing matrix operations faster than GPU instead?. But here I am using cupy rather than mxnet (cupy is newer and designed for GPU computing).

Can someone help? I woud really appreciate!

  • I suspects the random operations, they can be a bottleneck. – Klaus D. Oct 18 '20 at 04:44
  • @ Klaus D. well, you can copy and paste the code and run on your computer and see the result. I guess similar. I don't know whether or not it's because my GPU memory is too small (only 6G; by contrast, the RAM is 16G DDR4). I am just very very confused about the results in such an extremely simple example. – QuestionStudent Oct 18 '20 at 04:55
  • Also be aware that when doing a first GPU computation in the process (`C= cp.random.random([10000,10000])` in your example), a CUDA context initialization will happen which may take several seconds. – kmaehashi Oct 18 '20 at 14:26

1 Answers1

7

numpy random is generating floats (32bit) as default. Cupy random generates 64bit (double) by default. To make an apples to apples comparison, change the GPU random number generation like this:

C= cp.random.random([10000,10000], dtype=cp.float32)
D = cp.random.random([10000,10000], dtype=cp.float32)

I have different hardware (both CPU and GPU) than you, but once this change is made the GPU version is about 12x faster than cpu version. Generating both ndarray of random numbers, matrix multiplication and scalar multiplication using cupy takes less than one second in total

Stripedbass
  • 194
  • 1
  • 8
  • That makes sense. I modified it and execute it on the same computer. The new results are: **CPU time: 12.27 GPU time: 1.09**. Thank you so much! – QuestionStudent Oct 19 '20 at 03:48