I'm using gnumpy to speed up some computations in training a neural network by doing them on GPU. I'm getting the desired speedup but am a little bit worried about the differences in the results of numpy (cpu) vs gnumpy (gpu).
I have the following test script to illustrate the problem:
import gnumpy as gpu
import numpy as np
n = 400
a = np.random.uniform(low=0., high=1., size=(n, n)).astype(np.float32)
b = np.random.uniform(low=0., high=1., size=(n, n)).astype(np.float32)
ga = gpu.garray(a)
gb = gpu.garray(b)
ga = ga.dot(gb)
a = a.dot(b)
print ga.as_numpy_array(dtype=np.float32) - a
which provides the output:
[[ 1.52587891e-05 -2.28881836e-05 2.28881836e-05 ..., -1.52587891e-05
3.81469727e-05 1.52587891e-05]
[ -5.34057617e-05 -1.52587891e-05 0.00000000e+00 ..., 1.52587891e-05
0.00000000e+00 1.52587891e-05]
[ -1.52587891e-05 -2.28881836e-05 5.34057617e-05 ..., 2.28881836e-05
0.00000000e+00 -7.62939453e-06]
...,
[ 0.00000000e+00 1.52587891e-05 3.81469727e-05 ..., 3.05175781e-05
0.00000000e+00 -2.28881836e-05]
[ 7.62939453e-06 -7.62939453e-06 -2.28881836e-05 ..., 1.52587891e-05
7.62939453e-06 1.52587891e-05]
[ 1.52587891e-05 7.62939453e-06 2.28881836e-05 ..., -1.52587891e-05
7.62939453e-06 3.05175781e-05]]
As you can see, the differences are around the magnitude of 10^-5.
So the question is: should I be worried about these differences or is this the expected behaviour?
Additional information:
- GPU: GeForce GTX 770;
- numpy version: 1.6.1
I noticed the problem when I used gradient checking (with finite difference approximation) to verify that the small modifications I made to switch from numpy to gnumpy didn't break anything. As one may expect the gradient checking did not work with 32 bit precision (gnumpy does not support float64), but to my surprise the errors differed between CPU and GPU when using the same precision.
The errors on CPU and GPU on a small test neural network are given below:
Since the error magnitudes are similar, I guess that these differences are OK?
After reading the article, referenced in the comment by BenC, I'm quite sure that the differences can be mostly explained by one of the devices using the fused multiply-add (FMA) instruction and the other not.
I implemented the example from the paper:
import gnumpy as gpu
import numpy as np
a=np.array([1.907607,-.7862027, 1.147311, .9604002], dtype=np.float32)
b=np.array([-.9355000, -.6915108, 1.724470, -.7097529], dtype=np.float32)
ga = gpu.garray(a)
gb = gpu.garray(b)
ga = ga.dot(gb)
a = a.dot(b)
print "CPU", a
print "GPU", ga
print "DIFF", ga - a
>>>CPU 0.0559577
>>>GPU 0.0559577569366
>>>DIFF 8.19563865662e-08
...and the difference is similar to FMA vs serial algorithm (though for some reason both results differ from the exact result more than in the paper).
The GPU I'm using (GeForce GTX 770) supports FMA instruction while the CPU does not (I have an Ivy Bridge Intel® Xeon(R) CPU E3-1225 V2, but intel introduced the FMA3 instruction in their products with Haswell).
Other possible explanations include the different math libraries used in the background or differences in the sequence of operations caused by, for example, the different level of parallelization on CPU vs GPU.