1

So let's say I have a one column dataset: this column is a categorical variable with 5 levels (a,b,c,d,e). How can I compare the frequency of of each level to each other? Is there a way to do so? Thank you.

I tried but couldnt work it out

learningr
  • 11
  • 1

1 Answers1

0

The table function gives you counts. You can convert the table to a data.frame, if you want, and get proportions by dividing each count by the total number. Here is some dummy data, where group e is overrepresented:

df <- data.frame(var = ordered(c(rep('a', 2), rep('b', 4),
                                 rep('c', 4), rep('d', 3), rep('e', 10)))) %>% print()
table(df$var)
 a  b  c  d  e 
 2  4  4  3 10

Then we can calculate the frequency of each group:

df_counts <- as.data.frame(table(df$var))
df_counts$prop <- df_counts$Freq/sum(df_counts$Freq)
print(df_counts)
  Var1 Freq       prop
1    a    2 0.08695652
2    b    4 0.17391304
3    c    4 0.17391304
4    d    3 0.13043478
5    e   10 0.43478261

For statistical analysis, we can use Chi-square to determine if the distribution is likely to be the same as a random/null distribution:

chisq.test(df_counts$Freq)
Chi-squared test for given probabilities

data:  df_counts$Freq
X-squared = 8.5217, df = 4, p-value = 0.0742

Not quite! Also, this doesn't tell us which group is overrepresented. For that, we can do a very stupid, brute force permutation test: randomly sample the group variable over as many trials as our original data, 1000 times, and figure out how often the simulated count of each group is greater than the observed count. If the randomization gives a larger count for a given group than is seen in your real data, that group is probably not overrepresented.

# initialize permutation count columns
df_counts$n_greater <- rep(0, nrow(df_counts))
df_counts$n_lesser <- rep(0, nrow(df_counts))
set.seed(123)  # for reproducible "randomness"
# simulate 1000 random apportionments of group memberships to the observed number of trials
n_permut <- 1000
for(i in 1:n_permut) {
  # random "draw" of group variables
  sim <- sample(df_counts$Var1, nrow(df), replace=T)
  sim_df <- as.data.frame(table(sim))
  # for each group, was the number of randomized calls greater or lesser than observed?
  # increment counters accordingly
  df_counts$n_greater <- df_counts$n_greater + as.numeric(sim_df$Freq > df_counts$Freq)
  df_counts$n_lesser <- df_counts$n_lesser + as.numeric(sim_df$Freq < df_counts$Freq)
}
# the permutation test p-values are simply the proportion of simulations with greater or lesser counts
df_counts$p_greater <- df_counts$n_greater/n_permut
df_counts$p_lesser <- df_counts$n_lesser/n_permut
# we will use Bonferroni correction on the p-values, because of the multiple comparisons that we've performed
df_counts$p_greater <- p.adjust(df_counts$p_greater, method='bonferroni', n=nrow(df_counts) * 2)
df_counts$p_lesser <- p.adjust(df_counts$p_lesser, method='bonferroni', n=nrow(df_counts) * 2)
print(df_counts)
  Var1 Freq       prop n_greater n_lesser p_greater p_lesser
1    a    2 0.08695652       867       49      1.00     0.49
2    b    4 0.17391304       521      287      1.00     1.00
3    c    4 0.17391304       514      292      1.00     1.00
4    d    3 0.13043478       672      157      1.00     1.00
5    e   10 0.43478261         1      990      0.01     1.00

So by this rather basic method, group e has a highly significant p-value for overrepresentation, and none of the other groups are significant either way.

C. Murtaugh
  • 574
  • 4
  • 15
  • `cbind(Freq=table(df), prop=prop.table(table(df)))` – Onyambu Jun 29 '23 at 23:05
  • Thanks! But can i get a p to see if the any of these proportions is higher or lower than the others? – learningr Jun 30 '23 at 10:19
  • You can call `chisq.test(df_counts$Freq)` to perform a Chi-square test, which compares your distribution to a theoretical equal distribution, but that will just tell you that your distribution is uneven (or, if you were comparting it to another known distribution, it would tell you that the two were likely not the same). It won't tell you which group is driving the difference, and I'm not 100% sure what test *would* tell you that. This might be a question for the Stack Exchange [Cross Validated](https://stats.stackexchange.com/) site, which is stats-focused. – C. Murtaugh Jun 30 '23 at 15:22
  • Actually, one could do a permutation test, to ask how like is it that a given group was scored as often as it was in a given number of trials. This is very stupid and simplistic, like me, but also doesn't make a lot of assumptions about the distribution of your data. I will amend my answer accordingly. – C. Murtaugh Jun 30 '23 at 15:50