So let's say I have a one column dataset: this column is a categorical variable with 5 levels (a,b,c,d,e). How can I compare the frequency of of each level to each other? Is there a way to do so? Thank you.
I tried but couldnt work it out
So let's say I have a one column dataset: this column is a categorical variable with 5 levels (a,b,c,d,e). How can I compare the frequency of of each level to each other? Is there a way to do so? Thank you.
I tried but couldnt work it out
The table
function gives you counts. You can convert the table to a data.frame
, if you want, and get proportions by dividing each count by the total number. Here is some dummy data, where group e
is overrepresented:
df <- data.frame(var = ordered(c(rep('a', 2), rep('b', 4),
rep('c', 4), rep('d', 3), rep('e', 10)))) %>% print()
table(df$var)
a b c d e 2 4 4 3 10
Then we can calculate the frequency of each group:
df_counts <- as.data.frame(table(df$var))
df_counts$prop <- df_counts$Freq/sum(df_counts$Freq)
print(df_counts)
Var1 Freq prop 1 a 2 0.08695652 2 b 4 0.17391304 3 c 4 0.17391304 4 d 3 0.13043478 5 e 10 0.43478261
For statistical analysis, we can use Chi-square to determine if the distribution is likely to be the same as a random/null distribution:
chisq.test(df_counts$Freq)
Chi-squared test for given probabilities data: df_counts$Freq X-squared = 8.5217, df = 4, p-value = 0.0742
Not quite! Also, this doesn't tell us which group is overrepresented. For that, we can do a very stupid, brute force permutation test: randomly sample the group variable over as many trials as our original data, 1000 times, and figure out how often the simulated count of each group is greater than the observed count. If the randomization gives a larger count for a given group than is seen in your real data, that group is probably not overrepresented.
# initialize permutation count columns
df_counts$n_greater <- rep(0, nrow(df_counts))
df_counts$n_lesser <- rep(0, nrow(df_counts))
set.seed(123) # for reproducible "randomness"
# simulate 1000 random apportionments of group memberships to the observed number of trials
n_permut <- 1000
for(i in 1:n_permut) {
# random "draw" of group variables
sim <- sample(df_counts$Var1, nrow(df), replace=T)
sim_df <- as.data.frame(table(sim))
# for each group, was the number of randomized calls greater or lesser than observed?
# increment counters accordingly
df_counts$n_greater <- df_counts$n_greater + as.numeric(sim_df$Freq > df_counts$Freq)
df_counts$n_lesser <- df_counts$n_lesser + as.numeric(sim_df$Freq < df_counts$Freq)
}
# the permutation test p-values are simply the proportion of simulations with greater or lesser counts
df_counts$p_greater <- df_counts$n_greater/n_permut
df_counts$p_lesser <- df_counts$n_lesser/n_permut
# we will use Bonferroni correction on the p-values, because of the multiple comparisons that we've performed
df_counts$p_greater <- p.adjust(df_counts$p_greater, method='bonferroni', n=nrow(df_counts) * 2)
df_counts$p_lesser <- p.adjust(df_counts$p_lesser, method='bonferroni', n=nrow(df_counts) * 2)
print(df_counts)
Var1 Freq prop n_greater n_lesser p_greater p_lesser 1 a 2 0.08695652 867 49 1.00 0.49 2 b 4 0.17391304 521 287 1.00 1.00 3 c 4 0.17391304 514 292 1.00 1.00 4 d 3 0.13043478 672 157 1.00 1.00 5 e 10 0.43478261 1 990 0.01 1.00
So by this rather basic method, group e
has a highly significant p-value for overrepresentation, and none of the other groups are significant either way.