6

I've got a huge data set with six columns (call them A, B, C, D, E, F), about 450,000 rows. I simply tried to find the correlation between columns A and B:

cor(A, B)

and I got

[1] NA

as a result. What can I do to fix this problem?

Cleb
  • 25,102
  • 20
  • 116
  • 151
vieplivee
  • 121
  • 5

2 Answers2

13

Try cor(A,B, use = "pairwise.complete.obs"). That will ignore the NAs in your observations.

To be statistically rigorous, you should also look at the # of missing entries in your data and look at whether the missing at random assumption holds.

Edit 1: Take a look at ?cor to see other options for the use parameter.

Iterator
  • 20,250
  • 12
  • 75
  • 111
4

You might consider using the rcorr function in the Hmisc package.

It is very fast, and only includes pairwise complete observations. The returned object contains a matrix

  1. of correlation scores
  2. with the number of observation used for each correlation value
  3. of a p-value for each correlation

Some example code is available here:

Community
  • 1
  • 1
Iain
  • 1,608
  • 4
  • 22
  • 27