2

I am doing data analysis in Python (Numpy) and R. My data is a vector 795067 X 3 and computing the mean, median, standard deviation, and IQR on this data yields different results depending on whether I use Numpy or R. I crosschecked the values and it looks like R gives the "correct" value.

Median: 
Numpy:14.948499999999999
R: 14.9632

Mean: 
Numpy: 13.097945407088607
R: 13.10936

Standard Deviation: 
Numpy: 7.3927612774052083
R: 7.390328

IQR: 
Numpy:12.358700000000002
R: 12.3468

Max and min of the data are the same on both platforms. I ran a quick test to better understand what is going on here.

  • Multiplying 1.2*1.2 in Numpy gives 1.4 (same with R).
  • Multiplying 1.22*1.22 gives 1.4884 in Numpy and the same with R.
  • However, multiplying 1.222*1.222 in Numpy gives 1.4932839999999998 which is clearly wrong! Doing the multiplication in R gives the correct answer of 1.49324.
  • Multiplying 1.2222*1.2222 in Numpy gives 1.4937728399999999 and 1.493773 in R. Once more, R is correct.

In Numpy, the numbers are float64 datatype and they are double in R. What is going on here? Why are Numpy and R giving different results? I know R uses IEEE754 double-precision but I don't know what precision Numpy uses. How can I change Numpy to give me the "correct" answer?

  • 5
    It would help to show your code so we could address your actual problem. It is also important to distinguish between how floats are being *printed* versus the actual floating point *value*. For instance, in R, `sprintf("%.20f", 1.222*1.222)` prints `"1.49328399999999983372"` which identically matches what you get in Python with `'{:.20f}'.format(1.222*1.222)`. The floating point value is the same, but when you enter `1.222*1.222` at the R prompt, R prints `1.493284` while Python prints `1.4932839999999998` – unutbu Apr 15 '16 at 01:34
  • You might also try changing the `dtype` of your NumPy data to `float128`: `data = data.astype(np.float128)`. This might help, though it's just a shot in the dark without seeing both your Python and R code. – unutbu Apr 15 '16 at 01:38
  • @unutbu: R uses 64-bit floats, so sticking with 64-bit floats in Python is reasonable here. – John Zwinck Apr 15 '16 at 01:40
  • Try reducing your data set to a smaller set that still shows a discrepancy. Post your code and if possible, the reduced data set (you can't paste it here if it's large, so share it elsewhere). – John Zwinck Apr 15 '16 at 01:42
  • I believe unutbu is correct here. Some programming languages will make their output numbers nice, while underneath the true number is a bit different. Take for example `0.1+0.1` The answer should be `0.2` and that's what most languages will tell you, but if you twist their arm and force them to print the number in its full glory, you'll usually get something like `0.2000000000000000111022302`. This is not because the language is wrong, but rather the inherent limits of 64 bit calculations. – zephyr Apr 15 '16 at 02:20
  • 2
    _"Multiplying 1.2*1.2 in Numpy gives 1.4"_ - That's not how multiplication works! – Eric Apr 15 '16 at 03:58
  • 1.2222*1.2222 = 1.49377284, so numpy is within 10^-20, which is pretty good, given that there are no natural constants or measureable physical quantities known to that relative accuracy out there. The value given for R is simply rounded. Both R and numpy are fine, you are just using a rounded representation for R. As @unutbu noted. The statistical quantities are more serious, is there a possible 1/n vs 1/(n-1) definition difference (at least for standard deviation vs sample stdev)? – roadrunner66 Apr 15 '16 at 04:27
  • Actually the example suggests that R ostensibly uses _singe-precision_ floats, so I'm a bit confused why you are complaining about python. R is not printing double-precision values, even though it uses double-precision in calculations. – Jan Christoph Terasa Apr 15 '16 at 05:20

1 Answers1

5

Python

The print statement/function in Python will print single-precision floats. Calculations will actually be done in the precision specified. Python/numpy uses double-precision float by default (at least on my 64-bit machine):

import numpy

single = numpy.float32(1.222) * numpy.float32(1.222)
double = numpy.float64(1.222) * numpy.float64(1.222)
pyfloat = 1.222 * 1.222

print single, double, pyfloat
# 1.49328 1.493284 1.493284

print "%.16f, %.16f, %.16f"%(single, double, pyfloat)
# 1.4932839870452881, 1.4932839999999998, 1.4932839999999998

In an interactive Python/iPython shell, the shell prints double-precision results when printing the results of statements:

>>> 1.222 * 1.222
1.4932839999999998

In [1]: 1.222 * 1.222
Out[1]: 1.4932839999999998

R

It looks like R is doing the same as Python when using print and sprintf:

print(1.222 * 1.222)
# 1.493284

sprintf("%.16f", 1.222 * 1.222)
# "1.4932839999999998"

In contrast to interactive Python shells, the interactive R shell also prints single-precision when printing the results of statements:

> 1.222 * 1.222
[1] 1.493284

Differences between Python and R

The differences in your results could result from using single-precision values in numpy. Calculations with a lot of additions/subtractions will ultimately make the problem surface:

In [1]: import numpy

In [2]: a = numpy.float32(1.222)

In [3]: a*6
Out[3]: 7.3320000171661377

In [4]: a+a+a+a+a+a
Out[4]: 7.3320003

As suggested in the comments to your actual question, make sure to use double-precision floats in your numpy calculations.

Jan Christoph Terasa
  • 5,781
  • 24
  • 34