Essentially my question is very easy. I want to know where inside the lm
function, the data is subsetted (where the NA
's are removed), based on all the variables used in the formula. The reason that I want to know that, is because I want to sum a variable using only the subsetted data (where the NA
's are removed), and not my full dataset.
Background:
I tried to adapt the lm
function in R
, to account for the correct degrees of freedom when using weights in lm
. I thought the easiest solution would be to count the sum of weights, for the subsetted data, contingent on the selected variables, from within the function. So first I see where lm
subsets the dataset, based on the selected variable and there I count sum(ind$weight_freq)
, which I want to provide as one of the outputs for the function so that I can refer to it.
Example data:
library(dplyr)
set.seed(1024)
# individual (true) dataset
x <- round(rnorm(1e5))
y <- round(x + x^2 + rnorm(1e5))
ind <- data.frame(x, y)
# Create an NA value
ind[1,1] <- NA
ind <- ind %>%
group_by(x, y) %>%
summarize(weight_freq= n())
I started out simply by copying the lm code into a new function, and replacing all <-
with <<-
, to see where the data is subsetted, using lm_plus(y ~ x, data = ind, weights = weight_freq)
, and although I get the error, Error in lm_plus(y ~ x, data = ind, weights = weight_freq) : argument "offset" is missing, with no default
, the code goes far enough for the subsetting to happen (because mf
should be 99999
due to the on NA
, and it is):
lm_plus <- function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
{
ret.x <<- x
ret.y <<- y
cl <<- match.call()
mf <<- match.call(expand.dots = FALSE)
m <<- match(c("formula", "data", "subset",
"weights", "na.action", "offset"),
names(mf), 0L)
mf <<- mf[c(1L, m)]
mf$drop.unused.levels <<- TRUE
mf[[1L]] <<- quote(stats::model.frame)
mf <<- eval(mf, parent.frame())
if (method == "model.frame")
return(mf)
else if (method != "qr")
warning(gettextf("method = '%s' is not supported. Using 'qr'",
method), domain = NA)
mt <<- attr(mf, "terms")
y <<- model.response(mf, "numeric")
w <<- as.vector(model.weights(mf))
if (!is.null(w) && !is.numeric(w))
stop("'weights' must be a numeric vector")
offset <<- model.offset(mf)
mlm <<- is.matrix(y)
ny <<- if (mlm)
nrow(y)
else length(y)
if (!is.null(offset)) {
if (!mlm)
offset <<- as.vector(offset)
if (NROW(offset) != ny)
stop(gettextf("number of offsets is %d, should equal %d (number of observations)",
NROW(offset), ny), domain = NA)
}
if (is.empty.model(mt)) {
x <<- NULL
z <<- list(coefficients = if (mlm) matrix(NA_real_, 0,
ncol(y)) else numeric(), residuals = y, fitted.values = 0 *
y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w !=
0) else ny)
if (!is.null(offset)) {
z$fitted.values <<- offset
z$residuals <<- y - offset
}
}
else {
x <<- model.matrix(mt, mf, contrasts)
z <<- if (is.null(w))
lm.fit(x, y, offset = offset, singular.ok = singular.ok,
...)
else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok,
...)
}
class(z) <<- c(if (mlm) "mlm", "lm")
z$na.action <<- attr(mf, "na.action")
z$offset <<- offset
z$contrasts <<- attr(x, "contrasts")
z$xlevels <<- .getXlevels(mt, mf)
z$call <<- cl
z$terms <<- mt
if (model)
z$model <<- mf
if (ret.x)
z$x <<- x
if (ret.y)
z$y <<- y
if (!qr)
z$qr <<- NULL
z
}
I then tried renaming each of the mf
instances to mf1
, mf2
, mf3
, to see where mf
is actually subsetted, but I get stuck because I am getting errors (even though I thought I made sure I had the reference between the mf
correctly.
I also tried to put in test <<- sum(mf$weights, na.rm=TRUE)
here and there, but without success.
Is there anyone who could help me out with summing the weights in the right place?