How to run ANOVA on a wide format data.frame?

Question

I've been taught to run an ANOVA with the formula: aov(dependent variable~independent variable, dataset)

but I am struggling with how to run an ANOVA for a particular dataset because it is broken up into three columns that each contain a value. The three columns are designated newborn, adolescent and adult (which is hamster age) and the values within each column represent blood pressure values. I need to run a test to determine if there is a relationship between blood pressure and age.

This is what the data looks like in R:

> hamster
   Newborn adolescent adult
1      108        110   105
2      110        105   100
3       90        100    95
4       80         90    85
5      100        102    97
6      120        110   105
7      125        105   100
8      130        115   110
9      120        100    95
10     130        120   115
11     145        130   125
12     150        125   120
13     130        135   130
14     155        130   125
15     140        120   115

Confused because the dependent variable are those values ^ within each column

Karolis Koncevičius · Answer 1 · 2021-03-30T11:51:54.513

10

R has a useful function called stack to convert your data format into the one needed for ANOVA.

aov(values ~ ind, stack(hamster))

# Call:
#
# aov(formula = values ~ ind, data = stack(hamster))
#
# Terms:
#                       ind Residuals
# Sum of Squares   1525.378 11429.867
# Deg. of Freedom         2        42
#
# Residual standard error: 16.49666
# Estimated effects may be unbalanced

edited Mar 30 '21 at 11:51

answered Apr 29 '18 at 23:34

Karolis Koncevičius

9,417
9
56
89

score 8 · Answer 2 · answered Apr 29 '18 at 23:12

8

The first step is to rearrange your data so it's in a "long" format instead of a "wide" format. This can be done in base R using the reshape function, but it's much easier to use the gather function in the tidyr package:

library(tidyr)
result <- hampster %>%
  gather(age, bp) %>%
  aov(bp ~ age, .)

Using tidyr also gives us the pipe operator (%>%), which let's you chain commands together in a pretty way. By default, it works by taking the result of the previous function and inserting it as the first argument of the next function. In your aov function, we overrode this using the . operator to explicitly put the data set resulting from the gather function in as the 2nd argument.

answered Apr 29 '18 at 23:12

Melissa Key

4,476
12
21

Though understand that you are violating the assumption of independence by having repeated measures of the same hamster. – Elin Apr 29 '18 at 23:15
If that is the case, that's a different question entirely - and requires a different analysis tool (or at a minimum, an extra step or two to keep track of which animal is which). My understanding was that the OP wanted to rearrange the data set to make `aov` work. – Melissa Key Apr 29 '18 at 23:18
4

Yes that is what was asked, which you answered, and I upvoted the answer. However OP should know that what I said is an issue. – Elin Apr 29 '18 at 23:21

Len Greski · Answer 3 · 2018-04-30T01:06:19.543

Code to run a repeated measures analysis of variance with one within subject variable and no between subjects variables is as follows. Note that we use group_by() from the dplyr package to retain the hamster id number so we can use it as the error term in the ANOVA.

hamsterData <- "id   Newborn adolescent adult
1      108        110   105
2      110        105   100
3       90        100    95
4       80         90    85
5      100        102    97
6      120        110   105
7      125        105   100
8      130        115   110
9      120        100    95
10     130        120   115
11     145        130   125
12     150        125   120
13     130        135   130
14     155        130   125
15     140        120   115"

hamster <- read.table(text = hamsterData,header = TRUE )
library(tidyr)
library(dplyr)
result <- hamster %>% group_by(id) %>%
     gather(age,bp, Newborn,adolescent,adult)
result$age <- factor(result$age,levels=c("Newborn","adolescent","adult"))
options(contrasts=c("contr.sum","contr.poly"))
modelAOV <- aov(bp ~ age + Error(factor(id)),data = result)
summary(modelAOV)

...and the output:

> modelAOV <- aov(bp ~ age + Error(factor(id)),data = result)
> summary(modelAOV)

Error: factor(id)
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 14  10013   715.2               

Error: Within
          Df Sum Sq Mean Sq F value  Pr(>F)    
age        2   1525   762.7   15.07 3.6e-05 ***
Residuals 28   1417    50.6                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

How to run ANOVA on a wide format data.frame?

3 Answers3