1

I was trying to figure out how weighting in lm actually worked and I saw this 7,5 year old question which gives some insight in how weights work. The data from this question is partly copied and expanded on below.

I posted this related question, on Cross Validated.

library(plyr)
set.seed(100)
df <- data.frame(uid=1:200,
                      bp=sample(x=c(100:200),size=200,replace=TRUE),
                      age=sample(x=c(30:65),size=200,replace=TRUE),
                      weight=sample(c(1:10),size=200,replace=TRUE),
                      stringsAsFactors=FALSE)

set.seed(100)
df.double_weights <- data.frame(uid=1:200,
                      bp=sample(x=c(100:200),size=200,replace=TRUE),
                      age=sample(x=c(30:65),size=200,replace=TRUE),
                      weight=2*df$weight,
                      stringsAsFactors=FALSE)

df.expand <- ddply(df,
                        c("uid"),
                        function(df) {
                          data.frame(bp=rep(df[,"bp"],df[,"weight"]),
                                     age=rep(df[,"age"],df[,"weight"]),
                                     stringsAsFactors=FALSE)})

df.lm <- lm(bp~age,data=df,weights=weight)
df.double_weights.lm <- lm(bp~age,data=df.double_weights,weights=weight)
df.expand.lm <- lm(bp~age,data=df.expand)

summary(df.lm)
summary(df.double_weights.lm)
summary(df.expand.lm)

These three data.frames consist of exactly the same data. However;

In df there are 200 observations which are weighted to add up to 1178, sum(df.$weight) == 1178.

In df.double_weights, the weights are simply doubled sum(df.double_weights$weight) == 2356.

In df.expand, there are instead of 200, weighted observations, 1178 unweighted observations.

The coefficients for both summary(df.lm) and summary(df.double_weights.lm) are the same, and so is the significance, (which means that, IF THE WEIGHTING WORKS PROPERLY, the absolute size of the weights is irrelevant). EDIT: It seems however that the absolute size does matter, see bottom result.

However, for summary(df.lm) and summary(df.expand.lm), the coefficients are the same, but the significance differs.

summary(df.lm)

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 165.6545    10.3850  15.951   <2e-16 ***
age          -0.2852     0.2132  -1.338    0.183    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 98.84 on 198 degrees of freedom
Multiple R-squared:  0.008956,  Adjusted R-squared:  0.003951 
F-statistic: 1.789 on 1 and 198 DF,  p-value: 0.1825

summary(df.expand.lm)

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 165.65446    4.26123   38.88  < 2e-16 ***
age          -0.28524    0.08749   -3.26  0.00115 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 28.68 on 1176 degrees of freedom
Multiple R-squared:  0.008956,  Adjusted R-squared:  0.008114 
F-statistic: 10.63 on 1 and 1176 DF,  p-value: 0.001146

According to @IRTFM, the degrees of freedom are not being properly added up, providing this code to fix it:

df.lm.aov <- anova(df.lm)
df.lm.aov$Df[length(df.lm.aov$Df)] <- 
        sum(df.lm$weights)-   
        sum(df.lm.aov$Df[-length(df.lm.aov$Df)]  ) -1
df.lm.aov$`Mean Sq` <- df.lm.aov$`Sum Sq`/df.lm.aov$Df
df.lm.aov$`F value`[1] <- df.lm.aov$`Mean Sq`[1]/
                                        df.lm.aov$`Mean Sq`[2]
df.lm.aov$`Pr(>F)`[1] <- pf(df.lm.aov$`F value`[1], 1, 
                                      df.lm.aov$Df, lower.tail=FALSE)[2]
df.lm.aov

Analysis of Variance Table

Response: bp
            Df Sum Sq Mean Sq F value   Pr(>F)   
age          1   8741  8740.5  10.628 0.001146 **
Residuals 1176 967146   822.4                    

Now, almost 8 years later, apparently this problem still persists (Does this not mean that almost all research that used weighted variables in combination with lm from R has too low significance values?) More practically, the problem I have is that I hardly understand what IRTFM is doing, or how it relates to multiple regression analysis (or even other functions that use lm under the hood?).

QUESTION: Is there a more general way to solve this issue, that can be applied to multiple regression?

EDIT:

If we run IRTFM's solution on df.double_weights.lm, we get a different result, so apparently the absolute size of the weights DOES matter.

Analysis of Variance Table

Response: bp
            Df  Sum Sq Mean Sq F value    Pr(>F)    
age          1   17481 17481.0  21.274 4.194e-06 ***
Residuals 2354 1934293   821.7                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
jay.sf
  • 60,139
  • 8
  • 53
  • 110
Tom
  • 2,173
  • 1
  • 17
  • 44
  • 1
    This question is more about the decisions that drive mathematical operations under the hood of `lm`, and less about coding per se. I think you're likely to get a better answer posting on [Cross Validated](https://stats.stackexchange.com/). (Interesting question, btw!) – andrew_reece Dec 26 '20 at 16:28
  • The `Details` section of the `?lm` documentation addresses this quite explicitly (paragraph starts with "Non-'NULL'"). There are many reasons why one may use regression weights, including observations with different variances. In your 3rd case, you use "replication weights", and `lm` says that the degrees of freedom can be wrong, which obviously affects the p values. – Vincent Dec 26 '20 at 16:36
  • Thank you for the comments. My question is also essentially, how do I fix the degrees of freedom (which is also the actual question), but I will try to make that more clear. I think @IRTFM already provided a particular answer, but I do not understand the code. – Tom Dec 26 '20 at 16:38

1 Answers1

1

If I understand your question correctly, what you have in your weights column is often called "frequency weights". They are used to save space in your dataset by indicating how many observations you have for each combination of covariates.

To estimate a model with an "aggregated" dataset and obtain correct standard errors, all you need to do is correct the number of degrees of freedom in your lm model.

The correct number of degrees of freedom is the total number of observations, minus the number of parameters in your model. This can be calculated by taking the sum of your weights variable or by looking at the total number of observations in the "full" data, and subtracting the number of parameters estimated (i.e., coefficients).

Here's a simpler example, which I think makes the point clearer:

library(dplyr)
library(modelsummary)

set.seed(1024)

# individual (true) dataset
x <- round(rnorm(1e5))
y <- round(x + x^2 + rnorm(1e5))
ind <- data.frame(x, y)

# aggregated dataset
agg <- ind %>%
  group_by(x, y) %>%
  summarize(freq = n())

models <- list( 
  "True"                = lm(y ~ x, data = ind),
  "Aggregated"          = lm(y ~ x, data = agg),
  "Aggregated & W"      = lm(y ~ x, data = agg, weights=freq),
  "Aggregated & W & DF" = lm(y ~ x, data = agg, weights=freq)
)

Now we want to correct the number of degrees of freedom of the last model in our list. We do this by taking the sum of our freq column. We could also use nrow(ind), since those are identical:

# correct degrees of freedom
models[[4]]$df.residual <- sum(agg$freq) - length(coef(models[[4]]))

Finally, we summarize all 5 models using the modelsummary package. Notice that the first and last models are exactly the same, even if the first was estimated using the full individual dataset, and the last was estimated using the aggregated data:

modelsummary(models, fmt=5)
True Aggregated Aggregated & W Aggregated & W & DF
(Intercept) 1.08446 5.51391 1.08446 1.08446
(0.00580) (0.71710) (0.22402) (0.00580)
x 1.00898 0.91001 1.00898 1.00898
(0.00558) (0.30240) (0.21564) (0.00558)
Num.Obs. 1e+05 69 69 69
R2 0.246 0.119 0.246 0.246
R2 Adj. 0.246 0.106 0.235 0.999
AIC 405058.1 446.0 474.1 474.1
BIC 405086.7 452.7 480.8 480.8
Log.Lik. -202526.074 -219.977 -234.046 -234.046
F 32676.664 9.056 21.894 32676.664
Vincent
  • 15,809
  • 7
  • 37
  • 39
  • Thank you very much for you answer Vincent. I see that your answer refers to the weights directly using `sum(agg$freq)`. Is that the difference between the more complicated solution proposed by IRTFM, that IRTFM corrects the df's from within the the output? – Tom Dec 27 '20 at 08:50
  • If not, is there any way to produce the correct `df` from within the output (or simply artifically adding the weights to the output)? I often use data which has NA's, which makes `sum(agg$freq)` provide the incorrect `df`'s (because it provides the sum of all weights instead of just the weights for the selected vars). I am writing a function now to correct for this, but I was hoping that there might be an easier solution. – Tom Dec 27 '20 at 08:51
  • I figure it out, I simply adapted the `lm` function. I added `sum_of_weights <<- sum(as.vector(model.weights(mf)))` underneath `w <- as.vector(model.weights(mf))`, and then added `z$df.residual <- sum_of_weights - length(coef(z))` before the last line `z`. – Tom Dec 27 '20 at 11:06
  • 1
    Yep, that should work. The idea is that you need to identify the number of observations used by the `lm` function to calculate your regression coefficients *after* list-wise deletion. Sounds like you found a way to do that. – Vincent Dec 27 '20 at 13:25