I'm building an employee survey with two waves, and I want to make sure that each wave is balanced in terms of some demographic variables, such as ethnicity and gender. Here is a fictitious sample of the data:
library(tidyverse)
sample_data <- tibble(demographics = c("White / Female", "Non-White / Female", "White / Male", "Non-White / Male", "White / Transgender", "Non-White / Transgender"),
wave_1 = c(40, 38, 60, 56, 0, 2),
wave_2 = c(38, 39, 62, 58, 1, 0))
If I run the chisq.test() on sample_data, I get an error:
library(stats)
chisq.test(sample_data)
Error in chisq.test(sample_data) :
all entries of 'x' must be nonnegative and finite
But I don't get the error if I just use the two count columns:
sample_data_count <- sample_data %>%
dplyr::select(wave_1, wave_2)
chisq.test(sample_data_count)
Pearson's Chi-squared test
data: sample_data_count
X-squared = 3.1221, df = 5, p-value = 0.6812
Warning message:
In chisq.test(sample_data_count) :
Chi-squared approximation may be incorrect
I understand that R doesn't like that I have my demographics in the sample_data, but it's hard not having them in if I want to look at the observed values by various demographics. Is there a way to run the chisquare test with those row names in?
I saw an example using at http://www.sthda.com/english/wiki/chi-square-test-of-independence-in-r using this dataset (file_path <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt") that does do a chi square test in r with the row names still in it.
Any help would be appreciated!