I was trying to figure out how weighting in lm
actually worked and I saw this 7,5 year old question which gives some insight in how weights work. The data from this question is partly copied and expanded on below.
I posted this related question, on Cross Validated.
library(plyr)
set.seed(100)
df <- data.frame(uid=1:200,
bp=sample(x=c(100:200),size=200,replace=TRUE),
age=sample(x=c(30:65),size=200,replace=TRUE),
weight=sample(c(1:10),size=200,replace=TRUE),
stringsAsFactors=FALSE)
set.seed(100)
df.double_weights <- data.frame(uid=1:200,
bp=sample(x=c(100:200),size=200,replace=TRUE),
age=sample(x=c(30:65),size=200,replace=TRUE),
weight=2*df$weight,
stringsAsFactors=FALSE)
df.expand <- ddply(df,
c("uid"),
function(df) {
data.frame(bp=rep(df[,"bp"],df[,"weight"]),
age=rep(df[,"age"],df[,"weight"]),
stringsAsFactors=FALSE)})
df.lm <- lm(bp~age,data=df,weights=weight)
df.double_weights.lm <- lm(bp~age,data=df.double_weights,weights=weight)
df.expand.lm <- lm(bp~age,data=df.expand)
summary(df.lm)
summary(df.double_weights.lm)
summary(df.expand.lm)
These three data.frames consist of exactly the same data. However;
In df
there are 200 observations which are weighted to add up to 1178, sum(df.$weight) == 1178
.
In df.double_weights
, the weights are simply doubled sum(df.double_weights$weight) == 2356
.
In df.expand
, there are instead of 200, weighted observations, 1178 unweighted observations.
The coefficients for both summary(df.lm)
and summary(df.double_weights.lm)
are the same, and so is the significance, (which means that, IF THE WEIGHTING WORKS PROPERLY, the absolute size of the weights is irrelevant). EDIT: It seems however that the absolute size does matter, see bottom result.
However, for summary(df.lm)
and summary(df.expand.lm)
, the coefficients are the same, but the significance differs.
summary(df.lm)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 165.6545 10.3850 15.951 <2e-16 ***
age -0.2852 0.2132 -1.338 0.183
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 98.84 on 198 degrees of freedom
Multiple R-squared: 0.008956, Adjusted R-squared: 0.003951
F-statistic: 1.789 on 1 and 198 DF, p-value: 0.1825
summary(df.expand.lm)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 165.65446 4.26123 38.88 < 2e-16 ***
age -0.28524 0.08749 -3.26 0.00115 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 28.68 on 1176 degrees of freedom
Multiple R-squared: 0.008956, Adjusted R-squared: 0.008114
F-statistic: 10.63 on 1 and 1176 DF, p-value: 0.001146
According to @IRTFM, the degrees of freedom are not being properly added up, providing this code to fix it:
df.lm.aov <- anova(df.lm)
df.lm.aov$Df[length(df.lm.aov$Df)] <-
sum(df.lm$weights)-
sum(df.lm.aov$Df[-length(df.lm.aov$Df)] ) -1
df.lm.aov$`Mean Sq` <- df.lm.aov$`Sum Sq`/df.lm.aov$Df
df.lm.aov$`F value`[1] <- df.lm.aov$`Mean Sq`[1]/
df.lm.aov$`Mean Sq`[2]
df.lm.aov$`Pr(>F)`[1] <- pf(df.lm.aov$`F value`[1], 1,
df.lm.aov$Df, lower.tail=FALSE)[2]
df.lm.aov
Analysis of Variance Table
Response: bp
Df Sum Sq Mean Sq F value Pr(>F)
age 1 8741 8740.5 10.628 0.001146 **
Residuals 1176 967146 822.4
Now, almost 8 years later, apparently this problem still persists (Does this not mean that almost all research that used weighted variables in combination with lm
from R
has too low significance values?) More practically, the problem I have is that I hardly understand what IRTFM is doing, or how it relates to multiple regression analysis (or even other functions that use lm
under the hood?).
QUESTION: Is there a more general way to solve this issue, that can be applied to multiple regression?
EDIT:
If we run IRTFM's solution on df.double_weights.lm
, we get a different result, so apparently the absolute size of the weights DOES matter.
Analysis of Variance Table
Response: bp
Df Sum Sq Mean Sq F value Pr(>F)
age 1 17481 17481.0 21.274 4.194e-06 ***
Residuals 2354 1934293 821.7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1