Different computational time for the same Type Float32 in CUDA

Question

I compute a simple matrix multiplication with the following script:

import numpy as np
import math
from timeit import default_timer as timer
from numba import cuda
from numba import *
from numba import autojit

@autojit
def mult2(a,b):
    return a*b
@autojit
def mult_cpu(a,b,c):
    Ni=c.shape[0]
    Nj=c.shape[1]
    Nk=c.shape[2]
    for i in range(Ni):
        for j in range(Nj):
            for k in range(Nk):
                c[i,j,k]=mult2(a[i,k],b[j,k])

dimx=20
dimy=3072
dimz=50000

print "\ntest1"
A=np.ones((dimx,dimz),dtype=np.float32)
B=np.ones((dimy,dimz),dtype=np.float32)
C=np.ones((dimx,dimy,dimz),dtype=np.float32)
print A.shape,A.dtype
print B.shape,B.dtype
print C.shape,C.dtype
start=timer()
mult_cpu(A,B,C)
dt=timer()-start    
print "Computation autojit done in %f s"%(dt)
print 'C[:3,1,1] = ',C[:3,1,1]
print 'C[-3:,1,1] = ',C[-3:,1,1]
del A
del B
del C
del start
del dt


print "\ntest2"
A=np.zeros((dimx,dimz),dtype=np.float32)
B=np.zeros((dimy,dimz),dtype=np.float32)
C=np.zeros((dimx,dimy,dimz),dtype=np.float32)
print A.shape,A.dtype
print B.shape,B.dtype
print C.shape,C.dtype
start=timer()
mult_cpu(A,B,C)
dt=timer()-start    
print "Computation autojit done in %f s"%(dt)
print 'C[:3,1,1] = ',C[:3,1,1]
print 'C[-3:,1,1] = ',C[-3:,1,1]
del A
del B
del C
del start
del dt


print "\ntest3"
A=0.0001*np.random.randn(dimx,dimz).astype(np.float32)
B=0.0001*np.random.randn(dimy,dimz).astype(np.float32)
C=0.0001*np.random.randn(dimx,dimy,dimz).astype(np.float32)
print A.shape,A.dtype
print B.shape,B.dtype
print C.shape,C.dtype
start=timer()
mult_cpu(A,B,C)
dt=timer()-start    
print "Computation autojit done in %f s"%(dt)
print 'C[:3,1,1] = ',C[:3,1,1]
print 'C[-3:,1,1] = ',C[-3:,1,1]

Each test is equal except the initialization of A,B,C. The output is:

test1
(20, 50000) float32
(3072, 50000) float32
(20, 3072, 50000) float32
Computation autojit done in 4.485923 s
C[:3,1,1] =  [ 1.  1.  1.]
C[-3:,1,1] =  [ 1.  1.  1.]

test2
(20, 50000) float32
(3072, 50000) float32
(20, 3072, 50000) float32
Computation autojit done in 7.031277 s
C[:3,1,1] =  [ 0.  0.  0.]
C[-3:,1,1] =  [ 0.  0.  0.]

test3
(20, 50000) float32
(3072, 50000) float32
(20, 3072, 50000) float32
Computation autojit done in 45.372899 s
C[:3,1,1] =  [ -3.09475023e-09   4.71271910e-09   2.36787634e-09]
C[-3:,1,1] =  [ -7.29189642e-09  -3.03451442e-09   1.95249439e-09]

So, the matrix multiplication is faster for an np.ones than np.zeros initialization. And much more slower for a random initialization. How could one explain this behaviour?

Without the @autojit optimization the computational times are nearly equal.

score 0 · Answer 1 · answered May 13 '17 at 22:26

0

The autojit compiler realizes you're multiplying by all 0s and removes the matrix multiplication completely and simply returns a matrix of all 0s, in the 1s it skips the actual multiplication part and just does the summation part of a matrix multiplication which is slightly slower than just returning all 0s, finally the final one actually forces the compiler to have to do a matrix multiplication since it can't assume the answer.

This is a case where the compiler is being smarter than you expect it to.

answered May 13 '17 at 22:26

Simba

1,641
1
11
16

Well, actually the 0s are slower than the 1s.. returning a matrix of all 0s is slower than returning the summation. According to your argumentation this is a contradiction. – Andy May 13 '17 at 23:06
Ah, given you're doing 1s * 1s it adds more room to optimize. Clearly the matrix multiplication could be removed completely in both, now its not totally clear what the compiler decided to do just by looking at the code, you'd have to look at what sort of code it spit out to really get to the bottom of it. There may also be noise introduced in the run, esp if the numbers change enough from run to run. Anyway, the reason between the dramatic difference in the final run and the first two is due to the matrix multiply being factored out – Simba May 13 '17 at 23:24

Different computational time for the same Type Float32 in CUDA

1 Answers1