As pointed out in the answer by Paul Hiemstra, there is no way of knowing whether the correlation would have been higher or lower without missing values. However, for some applications it may be appropriate to penalize the observed correlation for non-matching missing values. For example, if we compare two individual coders, we may want coder B to say "NA" if and only if coder A says "NA" as well, plus we want their non-NA values to correlate.
Under these assumptions, a simple way to penalize non-matching missing values is to compute correlation for complete cases and multiply by the proportion of observations that are matched in terms of their NA-status. The penalty term can then be defined as: 1 - mean((is.na(coderA) & !is.na(coderB)) | (!is.na(coderA) & is.na(coderB)))
. A simple illustration follows.
fun = function(x1, x2, idx_rm) {
temp = x2
# remove 'idx_rm' points from x2
temp[idx_rm] = NA
# calculate correlations
r_full = round(cor(x1, x2, use = 'pairwise.complete.obs'), 2)
r_NA = round(cor(x1, temp, use = 'pairwise.complete.obs'), 2)
penalty = 1 - mean((is.na(temp) & !is.na(x1)) |
(!is.na(temp) & is.na(x1)))
r_pen = round(r_NA * penalty, 2)
# plot
plot(x1, temp, main = paste('r_full =', r_full,
'; r_NA =', r_NA,
'; r_pen =', r_pen),
xlim = c(-4, 4), ylim = c(-4, 4), ylab = 'x2')
points(x1[idx_rm], x2[idx_rm], col = 'red', pch = 16)
regr_full = as.numeric(summary(lm(x2 ~ x1))$coef[, 1])
regr_NA = as.numeric(summary(lm(temp ~ x1))$coef[, 1])
abline(regr_full[1], regr_full[2])
abline(regr_NA[1], regr_NA[2], lty = 2)
}
Run a simple simulation to illustrate the possible effects of missing values and penalization:
set.seed(928)
x1 = rnorm(20)
x2 = x1 * rnorm(20, mean = 1, sd = .8)
# A case when NA's artifically inflate the correlation,
# so penalization makes sense:
myfun(x1, x2, idx_rm = c(13, 19))
# A case when NA's DEflate the correlation,
# so penalization may be misleading:
myfun(x1, x2, idx_rm = c(6, 14))
# When there are a lot of NA's, penalization is much stronger
myfun(x1, x2, idx_rm = 7:20)
# Some NA's in x1:
x1[1:5] = NA
myfun(x1, x2, idx_rm = c(6, 14))