How to calculate correlation of two variables in a huge data set in R?

Question

I've got a huge data set with six columns (call them A, B, C, D, E, F), about 450,000 rows. I simply tried to find the correlation between columns A and B:

cor(A, B)

and I got

[1] NA

as a result. What can I do to fix this problem?

score 13 · Answer 1 · answered Sep 26 '11 at 06:05

Try cor(A,B, use = "pairwise.complete.obs"). That will ignore the NAs in your observations.

To be statistically rigorous, you should also look at the # of missing entries in your data and look at whether the missing at random assumption holds.

Edit 1: Take a look at ?cor to see other options for the use parameter.

score 4 · Answer 2 · edited May 23 '17 at 12:14

4

You might consider using the rcorr function in the Hmisc package.

It is very fast, and only includes pairwise complete observations. The returned object contains a matrix

of correlation scores
with the number of observation used for each correlation value
of a p-value for each correlation

Some example code is available here:

edited May 23 '17 at 12:14

Community

1
1

answered Sep 26 '11 at 09:59

Iain

1,608
4
22
27

How to calculate correlation of two variables in a huge data set in R?

2 Answers2