Matrix multiplication on CPU (numpy) and GPU (gnumpy) give different results

Question

I'm using gnumpy to speed up some computations in training a neural network by doing them on GPU. I'm getting the desired speedup but am a little bit worried about the differences in the results of numpy (cpu) vs gnumpy (gpu).

I have the following test script to illustrate the problem:

import gnumpy as gpu
import numpy as np

n = 400

a = np.random.uniform(low=0., high=1., size=(n, n)).astype(np.float32)
b = np.random.uniform(low=0., high=1., size=(n, n)).astype(np.float32)

ga = gpu.garray(a)
gb = gpu.garray(b)

ga = ga.dot(gb)
a  = a.dot(b)

print ga.as_numpy_array(dtype=np.float32) - a

which provides the output:

[[  1.52587891e-05  -2.28881836e-05   2.28881836e-05 ...,  -1.52587891e-05
    3.81469727e-05   1.52587891e-05]
 [ -5.34057617e-05  -1.52587891e-05   0.00000000e+00 ...,   1.52587891e-05
    0.00000000e+00   1.52587891e-05]
 [ -1.52587891e-05  -2.28881836e-05   5.34057617e-05 ...,   2.28881836e-05
    0.00000000e+00  -7.62939453e-06]
 ..., 
 [  0.00000000e+00   1.52587891e-05   3.81469727e-05 ...,   3.05175781e-05
    0.00000000e+00  -2.28881836e-05]
 [  7.62939453e-06  -7.62939453e-06  -2.28881836e-05 ...,   1.52587891e-05
    7.62939453e-06   1.52587891e-05]
 [  1.52587891e-05   7.62939453e-06   2.28881836e-05 ...,  -1.52587891e-05
    7.62939453e-06   3.05175781e-05]]

As you can see, the differences are around the magnitude of 10^-5.

So the question is: should I be worried about these differences or is this the expected behaviour?

Additional information:

GPU: GeForce GTX 770;
numpy version: 1.6.1

I noticed the problem when I used gradient checking (with finite difference approximation) to verify that the small modifications I made to switch from numpy to gnumpy didn't break anything. As one may expect the gradient checking did not work with 32 bit precision (gnumpy does not support float64), but to my surprise the errors differed between CPU and GPU when using the same precision.

The errors on CPU and GPU on a small test neural network are given below: gradient checking errors

Since the error magnitudes are similar, I guess that these differences are OK?

After reading the article, referenced in the comment by BenC, I'm quite sure that the differences can be mostly explained by one of the devices using the fused multiply-add (FMA) instruction and the other not.

I implemented the example from the paper:

import gnumpy as gpu
import numpy as np

a=np.array([1.907607,-.7862027, 1.147311, .9604002], dtype=np.float32)
b=np.array([-.9355000, -.6915108, 1.724470, -.7097529], dtype=np.float32)

ga = gpu.garray(a)
gb = gpu.garray(b)

ga = ga.dot(gb)
a  = a.dot(b)

print "CPU", a
print "GPU", ga
print "DIFF", ga - a

>>>CPU 0.0559577
>>>GPU 0.0559577569366
>>>DIFF 8.19563865662e-08

...and the difference is similar to FMA vs serial algorithm (though for some reason both results differ from the exact result more than in the paper).

The GPU I'm using (GeForce GTX 770) supports FMA instruction while the CPU does not (I have an Ivy Bridge Intel® Xeon(R) CPU E3-1225 V2, but intel introduced the FMA3 instruction in their products with Haswell).

Other possible explanations include the different math libraries used in the background or differences in the sequence of operations caused by, for example, the different level of parallelization on CPU vs GPU.

Here's a good read for you: [Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs](http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf) — BenC, Jan 09 '14 at 13:16
A difference of 10^-5 can be negligible or enormous depending on your input data. What order of magnitude does your input data have? — HyperCube, Jan 09 '14 at 14:17
@HyperCube the difference depends on the magnitude of the input. In the test script the input is in the interval [0,1]; the output has a magnitude of about 10^2. — Ottokar, Jan 10 '14 at 17:27

ali_m · Answer 1 · 2014-01-10T00:01:19.527

I would recommend using np.allclose for testing whether two float arrays are nearly equal.

Whereas you are only looking at the absolute difference between the values in your two result arrays, np.allclose also considers their relative differences. Suppose, for example, that the values in your input arrays were 1000x greater - then the absolute differences between the two results will also be 1000x greater, but that doesn't mean the two dot products were any less precise.

np.allclose will return True only if the following condition is met for every corresponding pair of elements in your two test arrays, a and b:

abs(a - b) <= (atol + rtol * abs(b))

By default, rtol=1e-5 and atol=1e-8. These tolerances are a good 'rule of thumb', but whether they are small enough in your case will depend on your particular application. For example, if you're dealing with values < 1e-8, then an absolute difference of 1e-8 would be a total disaster!

If you try calling np.allclose on your two results with the default tolerances, you'll find that np.allclose returns True. My guess, then, is that these differences are probably small enough that they're not worth worrying about. It really depends on what you're doing with the results.

score 4 · Answer 2 · answered May 05 '19 at 17:55

The RTX cards do floating point at half-precision because its faster for image rendering. You must tell the GPU to use full precision when multiplying floating point for AI. The precision is extremely important when doing AI.

I experienced this same Floating point difference you did when trying to use Cuda with an RTX 2080 Ti.

Matrix multiplication on CPU (numpy) and GPU (gnumpy) give different results

2 Answers2

Linked