0

I'm learning R and trying to understand how lm() handles factor variables & how to make sense of the ANOVA table. I'm fairly new to statistics, so please be gentle with me.

Here's some movie data from Rotten Tomatoes. I'm trying to model the score of each movie based on the mean scores for all of the movies in 4 groups: those rated G, PG, PG-13, and R.

download.file("http://www.rossmanchance.com/iscam2/data/movies03RT.txt", destfile = "./movies.txt")
movies <- read.table("./movies.txt", sep = "\t", header = T, quote = "")
lm1 <- lm(movies$score ~ as.factor(movies$rating))
anova(lm1)

and the ANOVA output:

## Analysis of Variance Table
## 
## Response: movies$score
##                           Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(movies$rating)   3    570     190    0.92   0.43
## Residuals                136  28149     207

I understand how to get all the numbers in this table, EXCEPT Sum Sq and Mean Sq for as.factor(movies$rating). Can someone please explain how that Sum Sq is calculated from my data? I know that Mean Sqis just Sum Sq divided by Df.

user7661
  • 57
  • 6

1 Answers1

1

There are various ways to get that. One of them is to use the equation:

http://en.wikipedia.org/wiki/Sum_of_squares_(statistics)

SS_total = SS_reg + SS_error

So:

y = movies$score
sum((y - mean(y))^2) - sum(lm1$residuals^2)
liuminzhao
  • 2,385
  • 17
  • 28
  • Please note that there is an ongoing quasi-religious war (at least a disagreement with vehemently argued opposing positions) between SAS and R authors regarding how to properly construct and partition sums-of-squares. – IRTFM Feb 13 '13 at 21:23