Lack of reproducibility between R and Excel for big data sets

Question

I'm running R version 3.0.2 in RStudio and Excel 2011 for Mac OS X. I'm performing a quantile normalization between 4 sets of 45,015 values. Yes I do know about the bioconductor package, but my question is a lot more general. It could be any other computation. The thing is, when I perform the computation (1) "by hand" in Excel and (2) with a program I wrote from scratch in R, I get highly similar, yet not identical results. Typically, the values obtained with (1) and (2) would differ by less than 1.0%, although sometimes more.

Where is this variation likely to come from, and what should I be aware of concerning number approximations in R and/or Excel? Does this come from a lack of float accuracy in either one of these programs? How can I avoid this?

[EDIT] As was suggested to me in the comments, this may be case-specific. To provide some context, I described methods (1) and (2) below in detail using test data with 9 rows. The four data sets are called A, B, C, D.

[POST-EDIT COMMENT] When I perform this on a very small data set (test sample: 9 rows), the results in R and Excel do not differ. But when I apply the same code to the real data (45,015 rows), I get slight variation between R and Excel. I have no clue why that may be.

(2) R code:

dataframe A

Aindex          A 
     1 2.1675e+05 
     2 9.2225e+03  
     3 2.7925e+01  
     4 7.5775e+02  
     5 8.0375e+00 
     6 1.3000e+03 
     7 8.0575e+00
     8 1.5700e+02
     9 8.1275e+01

dataframe B

Bindex          B
     1 215250.000
     2  10090.000
     3     17.125
     4    750.500
     5      8.605 
     6   1260.000 
     7      7.520 
     8    190.250
     9     67.350

dataframe C

Cindex          C 
     1 2.0650e+05 
     2 9.5625e+03 
     3 2.1850e+01 
     4 1.2083e+02  
     5 9.7400e+00   
     6 1.3675e+03
     7 9.9325e+00
     8 1.9675e+02
     9 7.4175e+01

dataframe D

Dindex           D 
     1 207500.0000
     2   9927.5000
     3     16.1250
     4    820.2500
     5     10.3025
     6   1400.0000
     7    120.0100
     8    175.2500
     9     76.8250

Code:

#re-order by ascending values
A <- A[order(A$A),, drop=FALSE]
B <- B[order(B$B),, drop=FALSE]
C <- C[order(C$C),, drop=FALSE]
D <- D[order(D$D),, drop=FALSE]
row.names(A) <- NULL
row.names(B) <- NULL
row.names(C) <- NULL
row.names(D) <- NULL

#compute average
qnorm <- data.frame(cbind(A$A,B$B,C$C,D$D))
colnames(qnorm) <- c("A","B","C","D")
qnorm$qnorm <- (qnorm$A+qnorm$B+qnorm$C+qnorm$D)/4

#replace original values by average values
A$A <- qnorm$qnorm
B$B <- qnorm$qnorm
C$C <- qnorm$qnorm
D$D <- qnorm$qnorm

#re-order by index number
A <- A[order(A$Aindex),,drop=FALSE]
B <- B[order(B$Bindex),,drop=FALSE]
C <- C[order(C$Cindex),,drop=FALSE]
D <- D[order(D$Dindex),,drop=FALSE]
row.names(A) <- NULL
row.names(B) <- NULL
row.names(C) <- NULL
row.names(D) <- NULL

(1) Excel

assign index numbers to each set.

Excel-step1

re-order each set in ascending order: select the columns two by two and use Custom Sort... by A, B, C, or D:

Excel-step2

calculate average=() over columns A, B, C, and D:

Excel-step3

replace values in columns A, B, C, and D by those in the average column using Special Paste... > Values:

Excel-step4

re-order everything according to the original index numbers:

Excel-step5

OK you are right maybe it is case-specific, give me a minute or two and I will include the code — biohazard, Mar 19 '14 at 07:05
See `?quantile` in R- there are quite a few different ways to calculate quantiles, so even without floating point accuracy, you may see differences depending on the method used. — Marius, Mar 19 '14 at 07:07
Am I getting negative points because I didn't include any code? is it not allowed to ask general questions on stackoverflow? or is the question plain stupid? — biohazard, Mar 19 '14 at 07:07
don't worry about negative vote (it wasnt me btw:-)), i think the question makes sense if you give a specific example but i don't think there is a general case here — lebatsnok, Mar 19 '14 at 07:38
lack of a reproducible example almost always gets downvotes. Pictures instead of text usually don't help either. You can't copy and paste the values from a picture — JeremyS, Mar 19 '14 at 07:38
i agree with jeremys. of course you need to change your question to make bot excel and r examples reproducible — lebatsnok, Mar 19 '14 at 07:43
I don't expect people to use Excel to reproduce this & there is no code to manipulate in Excel. on the other hand I am about to include the data & R code in text format, which may prompt an explanation as to why the results would differ, just give me a little more time.(>_<) — biohazard, Mar 19 '14 at 07:44
Sorry for the initially sloppy question. I added details about both R and Excel methods, they are intended to be the same method. I apologize for the screenshots but I don't know how to show Excel manipulations more clearly than with screenshots. And to top it all I didn't encounter the cross-platform reproducibility problem with the smaller test data set. T_T — biohazard, Mar 19 '14 at 08:15
Your example does not seem to reproduce the problem. R results seem to be identical to the Excel output. However, do you have ties in your real data? How does Excel deal with ties when sorting? — Roland, Mar 19 '14 at 08:43
my guess would be that the problem comes from somewhere in the "by hand" part of excel. — JeremyS, Mar 19 '14 at 09:17
There's no real reason to trust Excel in steps like "paste...values" to carry full precision, nor that Excel's machine precision is the same as R's. — Carl Witthoft, Mar 19 '14 at 11:24
I figured out why this was happening. There was a discrepancy between the original TAB-delimited text file (6 decimals) and subsequent data which I stored in CSV format, which for some reason only had 2 decimals. I was using the 6 decimals data for one analyis and the 2 decimal data for the other, hence the difference in results. I posted an answer below and am wondering what I should do with this thread. Sorry for taking your time. — biohazard, Mar 19 '14 at 13:03
Why not...? It just seemed easier to remember. If there is a serious reason against it, I would be happy to know. [edit] AH. there is a `qnorm()` function. Sorry for being such a noob. — biohazard, Mar 19 '14 at 15:02

lebatsnok · Answer 1 · 2014-03-19T07:41:21.383

2

if you use exactly the same algorithm you will get exactly the same results. not within 1% but to the 10th decimal. so you're not using the same algorithms. details probably won't change this general answer.

(or it could be a bug in excel or r but this is less likely)

edited Mar 19 '14 at 07:41

answered Mar 19 '14 at 07:34

lebatsnok

6,329
2
21
22

I added all the details for both R and Excel in the OP. Unfortunately, I didn't encounter the cross-platform reproducibility problem with the smaller test data set. How can that be? T_T – biohazard Mar 19 '14 at 08:16
maybe you forgot to select all the 45,015 rows in Excel in all cases? or something happened in sorting the data in excel? or it's about ties as Roland suggested above. if there are ties then sorting algorithms can behave differently – lebatsnok Mar 19 '14 at 10:46

biohazard · Accepted Answer · 2014-03-19T14:57:47.747

2

Answering my own question!

It ended up being Excel's fault (well, kind of): at some point, either in the conversion from the original TAB-delimited file to CSV, or later on when I started copying and pasting stuff, the values got rounded up.

The original TAB-delimited files had 6 decimals, whereas the CSV files only had 2. I had been doing the analysis so far with quantile normalization done in Excel from the 6-decimal data, whereas I read the data from the CSV files for my quantile normalization function in R, hence the change.

For the above examples for R and Excel respectively, I used data coming from the same source, which is why I got the same results.

What would you suggest would be best now that I figured this out: 1/Change the title to let other clueless people know that this kind of thing can happen? 2/Consider this post useless and delete it?

edited Mar 19 '14 at 14:57

answered Mar 19 '14 at 12:59

biohazard

2,017
10
28
41

BTW, Excel rounds data incorrectly (it always rounds "5" upwards), so aside from loss of precision, you'll get a bias in your dataset. – Carl Witthoft Mar 19 '14 at 14:10
That's good to know, I thought I was being careful with the "Paste Special > Values" option... – biohazard Mar 19 '14 at 14:57
3

I would say that this is a classic "learning experience" that we've all had or if not, will have soon. So, leave it up, and mark your answer as accepted. – Andy Clifton Mar 19 '14 at 15:28

Lack of reproducibility between R and Excel for big data sets

2 Answers2