Formatting data for a chi square test in R

Question

I am trying to reformat my data to run a chi square test in r. My data is set up with my independent variable in one column and the counts of my independent variable groups in two other columns. I made an example of my data format here.

> example <- data.frame(category = c("x","y","x","y"), true = c(2,4,6,3), false = c(7,9,3,5))
> example
  category true false
1        x    2     7
2        y    4     9
3        x    6     3
4        y    3     5

As far as I can tell the chisq.test function can't handle data in this format, so I think I need to reformat the data to look like the "good example" below to run the function. My problem is I'm not sure of an easy way to do this pivoting for a large data set.

> good_example <- data.frame(category = c('x','x','y','y','x','x','y','y'),
                           variable = c('true','false','true','false','true','false','true','false'),
                           count = c(2,7,4,9,6,3,3,5))
> good_example
  category variable count
1        x     true     2
2        x    false     7
3        y     true     4
4        y    false     9
5        x     true     6
6        x    false     3
7        y     true     3
8        y    false     5
> tab <- tapply(good_example$count, list(good_example$category, good_example$variable), FUN=sum)
> chisq.test(tab, correct = FALSE)

    Pearson's Chi-squared test

data:  tab
X-squared = 0.50556, df = 1, p-value = 0.4771

with `tidyr` you can use `pivot_longer` and something like: `pivot_longer(example, cols = c("true", "false"), names_to = "variable", values_to = "count")` — Ben, Dec 19 '19 at 20:24

StupidWolf · Accepted Answer · 2019-12-19T20:53:18.060

If you just need to sum up all the true and false, according to x and y, then:

tab = do.call(rbind,by(example[,-1],example$category,colSums))
chisq.test(tab,correct=FALSE)

A more compact version (pointed out by @markus), where you split the data according to category, and apply the sum function to all columns except the column used to split:

tab = aggregate(.~category, example, sum)

Or maybe dplyr / tidyr version:

library(dplyr)
tab = example %>% group_by(category) %>% summarise_all(sum)
chisq.test(tab[,-1],correct=FALSE)

Formatting data for a chi square test in R

1 Answers1