47

I have huge matrix with a lot of missing values. I want to get the correlation between variables.

1. Is the solution

cor(na.omit(matrix))

better than below?

cor(matrix, use = "pairwise.complete.obs")

I already have selected only variables having more than 20% of missing values.

2. Which is the best method to make sense ?

zx8754
  • 52,746
  • 12
  • 114
  • 209
Delphine
  • 1,113
  • 5
  • 15
  • 22

4 Answers4

22

I would vote for the second option. Sounds like you have a fair amount of missing data and so you would be looking for a sensible multiple imputation strategy to fill in the spaces. See Harrell's text "Regression Modeling Strategies" for a wealth of guidance on 'how's to do this properly.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
18

I think the second option makes more sense,

You might consider using the rcorr function in the Hmisc package.

It is very fast, and only includes pairwise complete observations. The returned object contains a matrix

  1. of correlation scores
  2. with the number of observation used for each correlation value
  3. of a p-value for each correlation

This means that you can ignore correlation values based on a small number of observations (whatever that threshold is for you) or based on a the p-value.

library(Hmisc)
x<-matrix(nrow=10,ncol=10,data=runif(100))
x[x>0.5]<-NA
result<-rcorr(x)
result$r[result$n<5]<-0 # ignore less than five observations
result$r
Iain
  • 1,608
  • 4
  • 22
  • 27
7

For future readers Pairwise-complete correlation considered dangerous may be valuable, arguing that cor(matrix, use = "pairwise.complete.obs") is considered dangerous and suggesting alternatives such as use = "complete.obs").

Triamus
  • 2,415
  • 5
  • 27
  • 37
  • 4
    I can't recommend that essay at all. The author proposes to show a counter-example of where pairwise correlations obviously wouldn't make intuitive sense, but nowhere is the mathematical definition of the correlation coefficient even mentioned. Consider this example, merely an extension of the author's demo: if A and B agree on all observations, but A has 99 observations and B only has 97, is it really absurd that pairwise-cor gives a correlation of 1, and would you conclude with the author that a correlation of NA is more reasonable? – David Klotz Mar 23 '18 at 02:00
0

Try WGCNA package. R base function, cor and some other packages like ppcor, shows an error if you have NA in your data. You need to get rid of NAs or set up some options. The package WGCNA handles the missing values issue plus provides some stats like pvalue for the calculated correlations.

library(WGCNA)
varX <- seq(from=1, to=10, length=10)
varY <- seq(from=20, to=50, length=10)
varZ <- rnorm(10)

varZ[c(1,5,7)] <- NA

mat <- cbind(varX, varY, varZ)

corAndPvalue(mat, method='spearman')
$cor
     varX varY varZ
varX  1.0  1.0  0.5
varY  1.0  1.0  0.5
varZ  0.5  0.5  1.0

$p
             varX         varY         varZ
varX 1.063504e-62 1.063504e-62 2.531700e-01
varY 1.063504e-62 1.063504e-62 2.531700e-01
varZ 2.531700e-01 2.531700e-01 1.411089e-39

$Z
          varX      varY      varZ
varX 51.953682 51.953682  1.228286
varY 51.953682 51.953682  1.228286
varZ  1.228286  1.228286 41.072992

$t
             varX         varY         varZ
varX 1.342177e+08 1.342177e+08 1.290994e+00
varY 1.342177e+08 1.342177e+08 1.290994e+00
varZ 1.290994e+00 1.290994e+00 1.061084e+08

$nObs
     varX varY varZ
varX   10   10    7
varY   10   10    7
varZ    7    7    7
xilliam
  • 2,074
  • 2
  • 15
  • 27
mehrdadorm
  • 49
  • 2
  • R base function, `cor` and some other packages like `ppcor`, shows an error if you have NA in your data. You need to get rid of NAs or set up some options. The package WGCNA handles the missing values issue plus provides some stats like pvalue for the calculated correlations. – mehrdadorm May 22 '22 at 13:42
  • Thanks. I added your comment to the body of your answer. That way, readers are more likely to see it, and it strengthens what you communicated. If you write more answers, you always have the option to edit, with the edit button beneath your post. – xilliam Jun 02 '22 at 04:40