I'm running a correlation on a large dataset (3500 obs x 1000 var). The problem that I'm facing is a large amount of missing data and I only want to include pairwise observations that meet a certain condition.
In the case where a pair of vectors has 1 NA value and 1 Numeric value, illustrated by row 1, columns 1 and 3 below, I want to convert the NA to a 0 and include it in the correlation. where both items in a pair are NA, illustrated by row 2, columns 1 and 3 below, I want those to be removed from the calculation.
[,1] [,2] [,3]
[1,] 2 1.5 NA
[2,] NA 2.0 NA
[3,] 0 0.0 0
[4,] 1 1.0 1
[5,] 2 2.0 2
I've looked into the methods available such as cor(x, use="pairwise.complete.obs")
and cor(x, use="complete.obs")
Unfortunately the methods above dont solve my problem.
I was able to solve this problem by putting each pair in a new data.frame variable, creating a set of conditions to filter out the undesirable observations and then running a correlation on that pair. However, Its a really clunky process, even if I put it in a loop. I'm hoping to find a much better and simpler way of solving this problem. Any help is greatly appreciated.