0

Any reason why the sum of predicted values and sum of dependent variable is same?

ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl*100, trt*20)
lm.D9 <- glm(weight ~ group,family = gaussian())
summary(lm.D9)
y<-predict(lm.D9,newdata=group,type="response")

sum(weight)
sum(y)

Also the dispersion is also very high (in my actual data). Any leads on how to tackle this? My original data is buidling a model on actual vs expected. I have tried 2 different models,

  1. Ratio of Actual by Expected as dependent and GLM with gaussian
  2. Actual - Expected difference as dependent.

But the dispersion in the second case is very high, and both models not validating.

Help appreciated!

user35655
  • 13
  • 5

1 Answers1

0

You have two groups, when you perform a linear regression, the predicted value is the mean of each group:

predict(lm.D9,newdata=data.frame(group=c("Ctl","Trt")))
     1      2 
503.20  93.22

You can check this:

tapply(weight,group,mean)
   Ctl    Trt 
503.20  93.22 

And if you sum up the predicted values, it is essentially the number of observations * mean which gives you back the sum of your values to begin with.

we can check how the data looks, and to me it looks ok, no crazy outliers:

boxplot(weight ~ group)

enter image description here

You can check out this post, the dispersion in lm is the sum of squared residuals divided by degree of freedom, basically the square of the deviation from your predicted values:

sum(residuals(lm.D9)^2)/lm.D9$df.residual
[1] 1825.962

Given the mean of your data is 298.21 , an average deviation of sqrt(1825.962) = 42.73128 is pretty ok

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thank you for the explaning. For my actual model the dispersion is way beyond the average. My average is around 500, while dispersion is 10000. Any remedies for this? My data has lot more outliers, but I cant remove or modify them. – user35655 Aug 25 '20 at 10:20
  • sqrt(10000) = 100, and this is still high? – StupidWolf Aug 25 '20 at 10:28
  • I meant the sqrt(100000000) which is 10000 – user35655 Aug 27 '20 at 13:10
  • if your data is dispersed like this, there are 2 possibilities, are the values should to be logged? or are there some other factors you did not account for. Right now there's too little information. i can only speculate – StupidWolf Aug 27 '20 at 15:01
  • Thank you, yeah I too dont have full data. Just have sample data and was trying to build a model. May be you are right, I might just have partial information. – user35655 Aug 27 '20 at 17:29