Calculating a correlation coefficient that includes missing values

Question

I'm looking to calculate some form of correlation coefficient in R (or any common stats package actually) in which the value of the correlation is influenced by missing values. I am not sure if this is possible and am looking for a method. I do not want to impute data, but actually want the correlation to be reduced based on the number of incomplete cases included in some systematic fashion. The data are a series of time points generated by different individuals and the correlation coefficient is being used to compute reliability. In many cases, one individual's data will include several more time points than the other individual...

Again, not sure if there is any standard procedure for dealing with such a situation.

If you are reducing correlation, you are, in effect, imputing data to affect that change. — Alex Reynolds, Dec 02 '11 at 08:08

G. Grothendieck · Answer 1 · 2011-12-02T13:20:55.097

One thing to look at is fitting a logistic regression to whether or not a point is missing. If there is no relationship then that provides support for assuming that the missing values won't provide any information. If that is your case then you won't have to impute anything and can just perform your computation without the missing values. glm in R can be used for logistic regression.

Also on a different note, see the use="pairwise.complete.obs" argument to cor which may or may not apply to you.

EDIT: I have revised this answer based on rereading the question.

score 2 · Answer 2 · answered Dec 02 '11 at 08:30

My feeling is that when there is a datapair that has one of the timeseries showing NA, that pair cannot be used for calculating a correlation as there is no information at that point. As there is no information on that point, there is no way to know how it would influence the correlation. Specifying that an NA reduces the correlation seems tricky, if an observation would be present at a point this could just as easily have improved the correlation.

Default behavior in R is to return NA for the correlation if there is an NA present. This behavior can be tweaked using the 'use' argument. See the documentation of that function for more details.

I agree with you, and nicely described. In a way, the correlation is already penalised by missing data, as this will reduce the sample size used to calculate the p-value. — Michelle, Dec 02 '11 at 19:07

score 1 · Answer 3 · answered Jul 13 '17 at 06:29

As pointed out in the answer by Paul Hiemstra, there is no way of knowing whether the correlation would have been higher or lower without missing values. However, for some applications it may be appropriate to penalize the observed correlation for non-matching missing values. For example, if we compare two individual coders, we may want coder B to say "NA" if and only if coder A says "NA" as well, plus we want their non-NA values to correlate.

Under these assumptions, a simple way to penalize non-matching missing values is to compute correlation for complete cases and multiply by the proportion of observations that are matched in terms of their NA-status. The penalty term can then be defined as: 1 - mean((is.na(coderA) & !is.na(coderB)) | (!is.na(coderA) & is.na(coderB))). A simple illustration follows.

fun = function(x1, x2, idx_rm) {
  temp = x2
  # remove 'idx_rm' points from x2
  temp[idx_rm] = NA

  # calculate correlations
  r_full = round(cor(x1, x2, use = 'pairwise.complete.obs'), 2)
  r_NA = round(cor(x1, temp, use = 'pairwise.complete.obs'), 2)
  penalty = 1 - mean((is.na(temp) & !is.na(x1)) |
                       (!is.na(temp) & is.na(x1)))
  r_pen = round(r_NA * penalty, 2)

  # plot
  plot(x1, temp, main = paste('r_full =', r_full, 
                              '; r_NA =', r_NA,
                              '; r_pen =', r_pen),
       xlim = c(-4, 4), ylim = c(-4, 4), ylab = 'x2')
  points(x1[idx_rm], x2[idx_rm], col = 'red', pch = 16)

  regr_full = as.numeric(summary(lm(x2 ~ x1))$coef[, 1])
  regr_NA = as.numeric(summary(lm(temp ~ x1))$coef[, 1])
  abline(regr_full[1], regr_full[2])
  abline(regr_NA[1], regr_NA[2], lty = 2)
}

Run a simple simulation to illustrate the possible effects of missing values and penalization:

set.seed(928)
x1 = rnorm(20)
x2 = x1 * rnorm(20, mean = 1, sd = .8)
# A case when NA's artifically inflate the correlation, 
# so penalization makes sense:
myfun(x1, x2, idx_rm = c(13, 19))

# A case when NA's DEflate the correlation, 
# so penalization may be misleading:
myfun(x1, x2, idx_rm = c(6, 14))

# When there are a lot of NA's, penalization is much stronger
myfun(x1, x2, idx_rm = 7:20)

# Some NA's in x1:
x1[1:5] = NA
myfun(x1, x2, idx_rm = c(6, 14))

Calculating a correlation coefficient that includes missing values

3 Answers3