2

I'm building an employee survey with two waves, and I want to make sure that each wave is balanced in terms of some demographic variables, such as ethnicity and gender. Here is a fictitious sample of the data:

library(tidyverse)
sample_data <- tibble(demographics = c("White / Female", "Non-White / Female", "White / Male", "Non-White / Male", "White / Transgender", "Non-White / Transgender"),
                      wave_1 = c(40, 38, 60, 56, 0, 2),
                      wave_2 = c(38, 39, 62, 58, 1, 0))

If I run the chisq.test() on sample_data, I get an error:

library(stats)
chisq.test(sample_data)

Error in chisq.test(sample_data) : 
  all entries of 'x' must be nonnegative and finite

But I don't get the error if I just use the two count columns:

sample_data_count <- sample_data %>%
  dplyr::select(wave_1, wave_2)
chisq.test(sample_data_count)

    Pearson's Chi-squared test

data:  sample_data_count
X-squared = 3.1221, df = 5, p-value = 0.6812

Warning message:
In chisq.test(sample_data_count) :
  Chi-squared approximation may be incorrect

I understand that R doesn't like that I have my demographics in the sample_data, but it's hard not having them in if I want to look at the observed values by various demographics. Is there a way to run the chisquare test with those row names in?

I saw an example using at http://www.sthda.com/english/wiki/chi-square-test-of-independence-in-r using this dataset (file_path <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt") that does do a chi square test in r with the row names still in it.

Any help would be appreciated!

aynber
  • 22,380
  • 8
  • 50
  • 63
J.Sabree
  • 2,280
  • 19
  • 48

2 Answers2

2

Because it also iincludes character column. According to ?chisq.test

x - a numeric vector or matrix. x and y can also both be factors.

y - a numeric vector; ignored if x is a matrix. If x is a factor, y should be a factor of the same length.

If we want to pass a numeric matrix, either select the numeric columns or convert the 'demographics' to row names, convert to matrix and apply the test

library(dplyr)
library(tibble)
sample_data %>% 
   column_to_rownames('demographics') %>%
   as.matrix %>% 
   chisq.test
Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
1

You can define your own function that runs the chi square on numeric columns only:

 my_chi <- function(df) chisq.test(as.matrix(df[, sapply(df, is.numeric)]))

So now you can do

my_chi(sample_data)
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  as.matrix(df[, sapply(df, is.numeric)])
#> X-squared = 3.1221, df = 5, p-value = 0.6812
#> 
#> Warning message:
#> In chisq.test(as.matrix(df[, sapply(df, is.numeric)])) :
#>   Chi-squared approximation may be incorrect
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87