I am trying to estimate a weighted LM with repeated observations. Naturally, one can exploit this redundancy information to save space and memory. I am working on an application with 400k observations, many of which are duplicated, and said LMs (and more complicated methods) are invoked many times, which is why weighted OLS is equivalent to OLS with duplicated observations. Without this memory-saving trick, the analysis cannot proceed because this robust VCOV is invoked hundreds of times. The idea of using observation counts to reduce computational burden is not new; a good example where it shines is Owen, 2017 (‘A weighted self-concordant optimization for empirical likelihood’).
The point estimates in the two variants of the model, unweighted (m
) and weighted (m.w
), are identical. However, sandwich::vcovHC
does not handle the weights as if they were observations weights.
MWE:
rm(list = ls())
x <- c(1, 1, 1, 1, 2:10) # Observation 1 has a count of 4
y <- c(15, 15, 15, 15, 13, 36, 31, 37, 63, 66, 80, 91, 94)
x.uniq <- x[-(1:3)]
y.uniq <- y[-(1:3)]
w <- c(4, rep(1, 9))
m <- lm(y ~ x)
m.w <- lm(y.uniq ~ x.uniq, weights = w)
coef(m) - coef(m.w) # Indentical
round(sandwich::vcovHC(m, type = "HC0"), 2)
# (Intercept) x
#(Intercept) 5.06 -0.53
#x -0.53 0.11
round(sandwich::vcovHC(m.w, type = "HC0"), 2) # Very different
This is how the correct VCOV should be computed in theory:
# White's asymptotic sandwich formula manually:
br <- solve(crossprod(model.matrix(m)))
me <- crossprod(model.matrix(m) * resid(m))
V <- br %*% me %*% br
round(V, 2)
# (Intercept) x.uniq
#(Intercept) 5.06 -0.53
#x.uniq -0.53 0.11
# Same formula, with observation counts
br.w <- solve(crossprod(model.matrix(m.w) * sqrt(weights(m.w))))
me.w <- crossprod(model.matrix(m.w) * resid(m.w) * sqrt(weights(m.w)))
V.w <- br.w %*% me.w %*% br.w
round(V.w, 2)
# (Intercept) x.uniq
#(Intercept) 5.06 -0.53
#x.uniq -0.53 0.11
The problem can be seen in the sandwich::meatHC
function. The following code computes the meat manually (simplest case, corresponding to type = "HC0"
)
X <- model.matrix(m)
X.w <- model.matrix(m.w)
res <- rowMeans(estfun(m)/X)
res.w <- rowMeans(estfun(m.w)/X.w)
omega <- res^2
omega.w <- res.w^2
meat <- crossprod(sqrt(omega) * X) / nrow(X)
meat.w <- crossprod(sqrt(omega.w) * X.w) / nrow(X.w)
all.equal(meat, meatHC(m, type = "HC0")) # TRUE
all.equal(meat.w, meatHC(m.w, type = "HC0")) # TRUE
Am I being silly and not seeing some obvious arguments that should be passed as ...
to estfun
, as well as the need for a custom omega
vector?
Yours sincerely, Andreï.