14

I have created a script like the one below to do something I called as "weighted" regression:

library(plyr)

set.seed(100)

temp.df <- data.frame(uid=1:200,
                      bp=sample(x=c(100:200),size=200,replace=TRUE),
                      age=sample(x=c(30:65),size=200,replace=TRUE),
                      weight=sample(c(1:10),size=200,replace=TRUE),
                      stringsAsFactors=FALSE)

temp.df.expand <- ddply(temp.df,
                        c("uid"),
                        function(df) {
                          data.frame(bp=rep(df[,"bp"],df[,"weight"]),
                                     age=rep(df[,"age"],df[,"weight"]),
                                     stringsAsFactors=FALSE)})

temp.df.lm <- lm(bp~age,data=temp.df,weights=weight)
temp.df.expand.lm <- lm(bp~age,data=temp.df.expand)

You can see that in temp.df, each row has its weight, what I mean is that there is a total of 1178 sample but for rows with same bp and age, they are merge into 1 row and represented in the weight column.

I used the weight parameters in the lm function, then I cross check the result with another dataframe that the temp.df dataframe is "expanded". But I found the lm outputs different for the 2 dataframe.

Did I misinterpret the weight parameters in lm function, and can anyone let me know how to I run regression properly (i.e. without expanding the dataframe manually) for a dataset presented like temp.df? Thanks.

lokheart
  • 23,743
  • 39
  • 98
  • 169
  • The two regressions yield identical results for me. – Vincent Zoonekynd Apr 22 '12 at 14:25
  • 1
    see the `summary` output, they are different – lokheart Apr 22 '12 at 14:26
  • 5
    The coefficients are the same, but the p-values are indeed different. I guess the following happens. When you expand the data, the observations are assumed to be independent: since there is a lot of data, you can be very confident on the estimates and the p-values are low. When using weights, the number of observations remains small, and the p-values are high. – Vincent Zoonekynd Apr 22 '12 at 14:32
  • but 2 dataframe refer to the same dataset (1178 rows of data), just different in presentation, `temp.df` present the 1178 rows of data using 200 rows. They should present the same p-value, if the same regression are performed in 2 dataframe. I need to get it fixed because in my case, I may have more than 1 million row, if I don't use the `weight` method, I may not have enough memory to store them all. – lokheart Apr 22 '12 at 14:40
  • @VincentZoonekynd - That does appear to be the case. The summaries show this directly. summary(temp.df.lm): ... Residual standard error: 69.89 on 198 degrees of freedom .... summary(temp.df.expand.lm): ... Residual standard error: 28.68 on 1176 degrees of freedom ... – Matthew Lundberg Apr 22 '12 at 17:14
  • 2
    This question should be moved to Cross Validated. – Carlos Cinelli Nov 12 '18 at 05:06

1 Answers1

17

The problem here is that the degrees of freedom are not being properly added up to get the right Df and mean-sum-squares statistics. This will correct the problem:

temp.df.lm.aov <- anova(temp.df.lm)
temp.df.lm.aov$Df[length(temp.df.lm.aov$Df)] <- 
        sum(temp.df.lm$weights)-   
        sum(temp.df.lm.aov$Df[-length(temp.df.lm.aov$Df)]  ) -1
temp.df.lm.aov$`Mean Sq` <- temp.df.lm.aov$`Sum Sq`/temp.df.lm.aov$Df
temp.df.lm.aov$`F value`[1] <- temp.df.lm.aov$`Mean Sq`[1]/
                                        temp.df.lm.aov$`Mean Sq`[2]
temp.df.lm.aov$`Pr(>F)`[1] <- pf(temp.df.lm.aov$`F value`[1], 1, 
                                      temp.df.lm.aov$Df, lower.tail=FALSE)[2]
temp.df.lm.aov
Analysis of Variance Table

Response: bp
            Df Sum Sq Mean Sq F value   Pr(>F)   
age          1   8741  8740.5  10.628 0.001146 **
Residuals 1176 967146   822.4        

Compare with:

> anova(temp.df.expand.lm)
Analysis of Variance Table

Response: bp
            Df Sum Sq Mean Sq F value   Pr(>F)   
age          1   8741  8740.5  10.628 0.001146 **
Residuals 1176 967146   822.4                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

I am a bit surprised this has not come up more often on R-help. Either that or my search strategy development powers are weakening with old age.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • There is an error in the upper block of code (`temp.df.lm.aovn Sq' <- temp.df.lm.aov$'Sum Sq'/temp.df.lm.aov$Df`). Note that the code did not correct the problem (the ANOVA tables differ). – gung - Reinstate Monica Jun 08 '13 at 02:28
  • I have attempted a correction. Please ensure you approve. Note that I used subsetting / indexing (ie, `[1]`), & it's not clear that's your style / as general as you may have wanted this to be. (However, the output does now match the output you wanted it to.) – gung - Reinstate Monica Jun 08 '13 at 02:43
  • There were syntactic errors (unmatched back-ticks) that I did not have the time to investigate. Thanks for trying to fix it. – IRTFM Jun 08 '13 at 03:42
  • 1
    Your welcome. The code, as fixed doesn't match the correct version. EG, p=.18 != p=.001; I thought you were trying to show that if the df were computed correctly, the `anova()` outputs would match. Was that not your point? – gung - Reinstate Monica Jun 08 '13 at 03:45
  • 1
    Yes. My point was actually that you needed to divide the sums of squares by the correct Df and then recalculate the F-statistic and the `Pr(>F)`. – IRTFM Jun 08 '13 at 04:02
  • Can someone share a reference on where the justification for this correction comes from? – Mark White Feb 07 '19 at 20:22
  • @MarkWhite. You asking about an answer that is almost 7 years old and you are not say what part of this is unclear. Do you have a specific area of confusion? – IRTFM Feb 07 '19 at 22:30