1

I'm running a correlation on a large dataset (3500 obs x 1000 var). The problem that I'm facing is a large amount of missing data and I only want to include pairwise observations that meet a certain condition.

In the case where a pair of vectors has 1 NA value and 1 Numeric value, illustrated by row 1, columns 1 and 3 below, I want to convert the NA to a 0 and include it in the correlation. where both items in a pair are NA, illustrated by row 2, columns 1 and 3 below, I want those to be removed from the calculation.

      [,1] [,2] [,3]  
[1,]    2  1.5   NA
[2,]   NA  2.0   NA
[3,]    0  0.0    0
[4,]    1  1.0    1
[5,]    2  2.0    2

I've looked into the methods available such as cor(x, use="pairwise.complete.obs") and cor(x, use="complete.obs")

Unfortunately the methods above dont solve my problem.

I was able to solve this problem by putting each pair in a new data.frame variable, creating a set of conditions to filter out the undesirable observations and then running a correlation on that pair. However, Its a really clunky process, even if I put it in a loop. I'm hoping to find a much better and simpler way of solving this problem. Any help is greatly appreciated.

  • Just out of curiosity, how do you justify this procedure? – January Jul 12 '19 at 20:13
  • I know its weird, but its due to the dataset that I'm using and the subject matter. the matrix that I used in the example above isn't actually from my data set 1) NAs are actually 0 in value but come over as NAs when I pivot the dataset. In this dataset a pairwise match of 0 and 0 has a very high chance of being coincidental and not actually reflect a relationship among the variables. 2) due to the subject matter of the data set, there are no complete observations. 3) it is impossible to obtain more observations. *ran out of space, see next comment – vincenzo345 Jul 12 '19 at 22:25
  • 4)in this case using complete pairwise observations will understate the actual relationship between variables while converting the NA's back to 0 will overstate the relationship. This is the only way I can think to get the most accurate correlation :/ – vincenzo345 Jul 12 '19 at 22:26
  • Then I think your solution is the most workable one, short of writing your own C code. – January Jul 13 '19 at 09:08

0 Answers0