7

I have a data set that begins like this:

> d.weight
    R   N   P  C D.weight
1   1   0   0 GO     45.3
2   2   0   0 GO     34.0
3   3   0   0 GO     19.1
4   4   0   0 GO     26.6
5   5   0   0 GO     23.5
6   1  45   0 GO     22.1
7   2  45   0 GO     15.5
8   3  45   0 GO     23.4
9   4  45   0 GO     15.8
10  5  45   0 GO     42.9
...

and so on.

  • R is replicate and there are 5 of them (1-5).
  • N is nitrogen level, and there are 5 as well (0, 45, 90, 180, 360).
  • P is phosphorus level, and there are 5 as well (0, 35, 70, 140, 280).
  • C is plant combination, and there are 4 (GO, GB, LO, LB).
  • D.weight is dry weight in grams.

However, when I do an ANOVA I get the wrong degrees of freedom. I usually run my ANOVAs on subsets of that full set of data, but let's just do an analysis I wouldn't actually do otherwise, just so you can see that almost all of the Df (degrees of freedom) are wrong.

> example.aov=aov(D.weight ~ R+N+P+C, data=d.weight)
> summary(example.aov)
         Df Sum Sq Mean Sq F value  Pr(>F)    
R             1   1158    1158   9.484 0.00226 ** 
N             1    202     202   1.657 0.19900    
P             1  11040   11040  90.408 < 2e-16 ***
C             3  41032   13677 112.010 < 2e-16 ***
Residuals   313  38220     122

So, basically, the only one that's right is the C factor. Is it because it has letters instead of numbers?

I found somewhere that if I write interaction() with each term, I get the right Df, but I don't know if that's the right thing to do overall. For example:

> example.aov2=aov(D.weight ~ interaction(R)+interaction(N)+interaction(P)+interaction(C), data=d.weight)
> summary(example.aov2)
                Df Sum Sq Mean Sq F value   Pr(>F)    
interaction(R)   4   7423    1856  19.544 2.51e-14 ***
interaction(N)   4    543     136   1.429    0.224    
interaction(P)   4  13788    3447  36.301  < 2e-16 ***
interaction(C)   3  41032   13677 144.042  < 2e-16 ***
Residuals      304  28866      95

I tried it with the C factor only to see if it messed up anything:

> example.aov3=aov(D.weight ~ C, data=d.weight)
> summary(example.aov3)
             Df Sum Sq Mean Sq F value Pr(>F)    
C             3  41032   13677   85.38 <2e-16 ***
Residuals   316  50620     160                   
> 
> example.aov4=aov(D.weight ~ interaction(C), data=d.weight)
> summary(example.aov4)
                Df Sum Sq Mean Sq F value Pr(>F)    
interaction(C)   3  41032   13677   85.38 <2e-16 ***
Residuals      316  50620     160

And it looks the same. Should I be adding interaction() everywhere?

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
XGF
  • 73
  • 1
  • 5
  • 3
    convert your numeric variables to factors ... e.g. `facs <- c("R","N","P"); d_weight[facs] <- lapply(d.weight[facs],factor)` – Ben Bolker Oct 13 '14 at 16:04
  • WOW. I went crazy for like 1h trying to figure this out by myself, and you fixed it in less than 1 minute. THANK YOU. – XGF Oct 13 '14 at 16:10
  • Ok, that has fixed one problem but now it has caused another. When I try to make a line plot with N or P in the x-axis, since N and P are not continuous variables anymore, I get a bar plot instead... unsurprisingly enough. I do need the program to think of N and P as continuous in some instances, so... what should I do? – XGF Oct 13 '14 at 16:25
  • It depends; you could either leave the original data as numeric and use explicit `factor()` statements within your statistical model statement (e.g. `D.weight ~ factor(R)+factor(N)+factor(P)`), or make the new factor variables auxiliary variables (e.g. `fR`, `fN`, `fP`) instead of overwriting the old variables ... if you give a fully reproducible example of the kind of plot you're aiming for you might get more explicit help. – Ben Bolker Oct 13 '14 at 16:35
  • Seriously? Neither you nor your advisor knows how to write mathematical algorithms? Conversion to code is a minor point in the process. – Carl Witthoft Oct 13 '14 at 16:56
  • @BenBolker: thanks, either of those two options should do the trick! I would add them to your answer as extra choices. – XGF Oct 13 '14 at 17:09
  • 2
    @CarlWitthoft: I'm an agriculture student and I've only had one semester of stats. My advisor is 70 and almost blind, sooo... what can I say, I do what I can. – XGF Oct 13 '14 at 17:13
  • Actually, @BenBolker, what would be the code for making those auxiliary variables? – XGF Oct 13 '14 at 17:17

1 Answers1

6

R determines whether it should treat variables as categorical (ANOVA-type analysis) or continuous (regression-type analysis) by checking whether they are numeric or factor variables. Most simply, you can convert your predictor (independent) variables to factors via

facs <- c("R","N","P")
d.weight[facs] <- lapply(d.weight[facs],factor) 

If you want to create auxiliary variables instead of overwriting you could do something like

for (varname in facs) {
   d.weight[[paste0("f",varname)]] <- factor(d.weight[[varname]])
}

There might be a more compact way to do it but that should serve ...

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • follow up question: did you mean to create a new variable when you wrote "d_weight[facs]", or did you mean to write "d.weight[facs]"? – daOnlyBG Aug 18 '15 at 15:32