Chi-Square Test of Independence in r

Question

I have a technical question related to my df structure. It looks like this:

    Month District   Age Gender Education Disability Religion                          Occupation JobSeekers GMI
1 2020-01      Dan   U17   Male      None       None   Jewish              Unprofessional workers          2   0
2 2020-01      Dan   U17   Male      None       None  Muslims          Sales and costumer service          1   0
3 2020-01      Dan   U17 Female      None       None    Other                           Undefined          1   0
4 2020-01      Dan 18-24   Male      None       None   Jewish         Production and construction          1   0
5 2020-01      Dan 18-24   Male      None       None   Jewish                     Academic degree          1   0
6 2020-01      Dan 18-24   Male      None       None   Jewish Practical engineers and technicians          1   0
  ACU NACU NewSeekers NewFiredSeekers
1   0    2          0               0
2   0    1          0               0
3   0    1          0               0
4   0    1          0               0
5   0    1          0               0
6   0    1          1               1

And I'm looking for a way to make an Chi-Square Test of Independence between 2 variables like district and JobSeekers so i can tell if Northern district related to jobseekers more than the southern for example. As far as i can tell, something is wrong with the data structure (District is a char and jobseekers is an integer which indicate how many jobseekers I have based on District, Gender, Occupation etc) I tried to subset it to district and jobseekers like this:

  Month   District  JobSeekers   GMI   ACU  NACU NewSeekers NewFiredSeekers
  <chr>   <chr>          <int> <int> <int> <int>      <int>           <int>
1 2020-01 Dan            33071  4694  9548 18829       6551            4682
2 2020-01 Jerusalem      21973  7665  3395 10913       3589            2260
3 2020-01 North          47589 22917  4318 20354       6154            3845
4 2020-01 Sharon         25403  6925  4633 13845       4131            2727
5 2020-01 South          37089 18874  2810 15405       4469            2342
6 2020-02 Dan            32660  4554  9615 18491       5529            3689

But it makes it harder to handle I'll accept any other test that will work of course.

Please help and let me know if there's more information that you need,

Moshe

Update

# t test for district vs new seekers

# sorting

dist.newseek <- Cdata %>% 
  group_by(Month,District) %>% 
  summarise(NewSeekers=sum(NewSeekers))

# performing a t test on the mini table we created

t.test(NewSeekers ~ District,data=subset(dist.newseek,District %in% c("Dan","South")))

# results

Welch Two Sample t-test

data:  NewSeekers by District
t = 0.68883, df = 4.1617, p-value = 0.5274
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -119952.3  200737.3
sample estimates:
  mean in group Dan mean in group South 
74608.25            34215.75 

#wilcoxon test 

# filtering Cdata to New seekers based on month and age

age.newseek <- Cdata %>% 
  group_by(Month,Age) %>% 
  summarise(NewSeekers=sum(NewSeekers))

#performing a wilcoxon test on the subset 

wilcox.test(NewSeekers ~ Age,data=subset(age.newseek,Age %in% c("25-34","45-54")))

# Results

Wilcoxon rank sum exact test

data:  NewSeekers by Age
W = 11, p-value = 0.4857
alternative hypothesis: true location shift is not equal to 0

ANOVA test

# Sorting occupation and month by new seekers

occu.newseek <- Cdata %>% 
  group_by(Month,Occupation) %>% 
  summarise(NewSeekers=sum(NewSeekers))

## Make the Occupation as a factor

occu.newseek$District <- as.factor(occu.newseek$Occupation)

## Get the occupation group means and standart deviations

group.mean.sd <- aggregate(
  x = occu.newseek$NewSeekers, # Specify data column
  by = list(occu.newseek$Occupation), # Specify group indicator
  FUN = function(x) c('mean'=mean(x),'sd'= sd(x))
)

## Run one way ANOVA test
anova_one_way <- aov(NewSeekers~ Occupation, data = occu.newseek)
summary(anova_one_way)

## Run the Tukey Test to compare the groups 
TukeyHSD(anova_one_way)

## Check the mean differences across the groups 

library(ggplot2)
ggplot(occu.newseek, aes(x = Occupation, y = NewSeekers, fill = Occupation)) +
  geom_boxplot() +
  geom_jitter(shape = 15,
              color = "steelblue",
              position = position_jitter(0.21)) +
  theme_classic()

Plot

you can't do a chi-square. JobSeekers is continuous. If you want to know whether North or South is associated with higher number of JobSeekers, do a t.test or Manney-U ? — StupidWolf, Jun 22 '20 at 21:49
Thanks so much for your answer @StupidWolf, Can you explain me how can i do this specific t test with the current table? If you find it hard to do, even a source with relavant example would be great. Thanks again for your attention! — Moshep, Jun 22 '20 at 22:37
I did. In the main dataframe each row is based on multiple sectors like age,religion and more, and than sum by job seekers, new seekers and so on, so i tried to redude sectors so it'll be easier. The whole dataframe contains details from January to April during the crisis caused by the coronsvirus, thats the main subject of the project - increase in unemployment and the link to occupation, districts and so on. Hope it's not too much worthless information and thanks again! — Moshep, Jun 22 '20 at 22:53
alright.. i think you needa think more about what is your hypothesis. I tried to provide a quick answer about what you can do. You can also check the other answer, it uses an anova. In theory it's the same, but comes with more assumptions, and yeah see whether it makes sense for your question — StupidWolf, Jun 22 '20 at 22:59

score 1 · Answer 1 · answered Jun 22 '20 at 22:56

You can't do a chi-square because JobSeekers is continuous, so if you want to know there is difference between North and South district, you can use a wilcoxon or t.test. It depends on your data. wilcoxon is ranked based and does not require your data to be normally distributed.

Assuming you have tallied the number of job seekers per district and month:

df = data.frame(Month=rep(c("2020-01","2020-02","2020-03","2020-04","2020-05","2020-06"),3),
District=rep(c("Dan","North","South"),each=6),JobSeekers=rpois(18,20))

A t.test is done like below, however if your samples are paired, for example you have 12 values for each month for North, and a corresponding 12 for south, then you need to set paired=FALSE, see this tutorial:

t.test(JobSeekers ~ District,data=subset(df,District %in% c("North","South")))

    Welch Two Sample t-test

data:  JobSeekers by District
t = 0.27455, df = 9.9435, p-value = 0.7893
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.560951  4.560951
sample estimates:
mean in group North mean in group South 
               21.5                21.0

If you are not sure whether your samples are normally distributed, use a wilcoxon:

wilcox.test(JobSeekers ~ District,data=subset(df,District %in% c("North","South")))

    Wilcoxon rank sum test with continuity correction

data:  JobSeekers by District
W = 19.5, p-value = 0.8721
alternative hypothesis: true location shift is not equal to 0

Hi @StupidWolf, thanks again for your attention and time. So I edited the original message with the tests I performed. I did two test based on your theory - the t test is using District and New seekers, and the second one using wilcoxon to check Age and New seekers. Is there any way i can show conclusions on plot? I tried to name it and call it by plot() but didn't succeed due to the characters. Let me thank you one last time, you literally saved me cause my deadline for present those info is for tommorow. — Moshep, Jun 23 '20 at 10:29
Hi @Moshep, don't edit your original question. The other user has answered the question, you have accepted it and these two should tally! If you have a new issue, please post a separate question. The purpose of SO is also to have a good repository of questions / answers. What you are doing totally defeats the purpose — StupidWolf, Jun 23 '20 at 10:51
I didn't mean to do that. in fact, everything remain the same and i didn't delete any info, just added the results. I prevent opening new question because i don't want to spam for no reason so i thought you can help in the current question but thanks anyway! — Moshep, Jun 23 '20 at 11:20

mustafaakben · Accepted Answer · 2020-06-23T14:06:48.293

You can use an ANOVA test to compare multiple groups. If you find any statistically significant results with the omnibus ANOVA test, then you can check which district is different better or worse.

You can also refer to UCLA's website that shows which tests one should use to test their data. The link is here.

As a quick example, let me put here how to run an ANOVA test.

Here is your data:

head(df)

r$> head(df)
    Month  District   Age Gender Education Disability Religion                          Occupation JobSeekers GMI ACU NACU NewSeekers NewFiredSeekers
1 2020-01       Dan 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1
2 2020-01     North 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1
3 2020-01     North 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1
4 2020-01     South 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1
5 2020-01       Dan 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1
6 2020-01 Jerusalem 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1

Because I need more data points to carry out the test, I replicated your data by bootstrapping. I also increased the number of job seekers in the South and North regions. You do not need to do the following steps in your data. But this is how I did.

# For the sake of this example, I increased the number of observation by bootstrapping the example data

for(i in 1:20) df <- rbind(df[sample(6, 5), ],df)
rownames(df) <- 1:nrow(df)
df$District <- sample(c("Jerusalem", "North", "Sharon", "South", "Dan"), nrow(df),replace = T)
df$JobSeekers[df$District == "North"] <- sample(1:3,length(df$JobSeekers[df$District == "North"]),replace=T,p=c(0.1,0.5,0.4))
df$JobSeekers[df$District == "South"] <- sample(4:6,length(df$JobSeekers[df$District == "South"]),replace=T,p=c(0.1,0.5,0.4))

It is preferable to make the character as a factor when you are analyzing categorical variables. By doing this, you can control the level of factors.

## Make the District as a factor

df$District <- as.factor(df$District)

Next, get the group means and standard deviations to see whether there are any meaningful differences across the group. As you see, I changed the south and north districts, so they have the highest mean scores compared to the rest districts.

## Get the group means and standart deviations
    group.mean.sd <- aggregate(
        x = df$JobSeekers, # Specify data column
        by = list(df$District), # Specify group indicator
        FUN = function(x) c('mean'=mean(x),'sd'= sd(x))
    )

r$> group.mean.sd
    Group.1    x.mean      x.sd
1       Dan 1.1000000 0.3077935
2 Jerusalem 1.0000000 0.0000000
3     North 2.3225806 0.5992827
4    Sharon 1.1363636 0.3512501
5     South 5.2380952 0.4364358

Finally, you can run an ANOVA test and Tukey test as follow.

## Run one way ANOVA test
    anova_one_way <- aov(JobSeekers~ District, data = df)
    summary(anova_one_way)

r$> summary(anova_one_way)
             Df Sum Sq Mean Sq F value Pr(>F)
District      4 260.09   65.02   346.1 <2e-16 ***
Residuals   101  18.97    0.19
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Run the Tukey Test to compare the groups 
    TukeyHSD(anova_one_way)

r$> Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = JobSeekers ~ District, data = df)

$District
                        diff        lwr        upr     p adj
Jerusalem-Dan    -0.10000000 -0.5396190  0.3396190 0.9695592
North-Dan         1.22258065  0.8772809  1.5678804 0.0000000
Sharon-Dan        0.03636364 -0.3356042  0.4083315 0.9987878
South-Dan         4.13809524  3.7619337  4.5142567 0.0000000
North-Jerusalem   1.32258065  0.9132542  1.7319071 0.0000000
Sharon-Jerusalem  0.13636364 -0.2956969  0.5684241 0.9048406
South-Jerusalem   4.23809524  3.8024191  4.6737714 0.0000000
Sharon-North     -1.18621701 -1.5218409 -0.8505932 0.0000000
South-North       2.91551459  2.5752488  3.2557803 0.0000000
South-Sharon      4.10173160  3.7344321  4.4690311 0.0000000

Lastly, you can plot which district has the highest job seekers with a bar plot.

## Check the mean differences across the groups 

library(ggplot2)
ggplot(df, aes(x = District, y = JobSeekers, fill = District)) +
    geom_boxplot() +
    geom_jitter(shape = 15,
        color = "steelblue",
        position = position_jitter(0.21)) +
    theme_classic()

Update

Based on your update, you can use the following syntax to abbreviate the x labels and change the legends.

library(stringr)
library(ggplot2)
ggplot(occu.newseek, aes(x = Occupation, y = NewSeekers, fill = str_wrap(Occupation,10))) +
    geom_boxplot() +
    geom_jitter(
        shape = 19,
        color = "black",
        position = position_jitter(0.21)
    ) +
     scale_x_discrete(
        labels =
            c(
                "Academic degree" = "Academic",
                "Practical engineers and technicians" = "Engineering",
                'Production and construction'='Production',
                "Sales and costumer service" = "Sales",
                "Unprofessional workers" = "Unprofessional",
                "Undefined" = "Undefined"
            )
    ) +
    labs(fill = "Occupation") +
        theme_classic()+
        theme(
            axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1), legend.key.height=unit(2, "cm")
            #legend.position = "top",
            
        )

You should get a graph like that.

Hi @mustafaakben, many thanks for your detailed answer brother. I understand all the process and follow through it. As i perform all the relevant steps (ignoring the first ones as i have the dataset) and tried to plot it, i get weird results. Can you take a look in the original message and tell me what's wrong? Thank you so much! — Moshep, Jun 23 '20 at 10:33
Hi @Moshep. In terms of the syntax, I think one of the problematic parts is the code in which you sorted the occupation and month by new seekers ('# Sorting occupation and month by new seekers'). Omit this part and run the rest of the analysis. It might work. In terms of the graph, the 'x' labels and the legends might create a problem, because your labels are so lengthy. Try to use abbreviations for the labels and change the arrangement a little bit. For more detail see ggplot2 website [link](https://ggplot2.tidyverse.org/index.html) — mustafaakben, Jun 23 '20 at 14:00
For now, try to run the updated code for the graph. As I said, do **not** use the code ``` occu.newseek <- Cdata %>% group_by(Month,Occupation) %>% summarise(NewSeekers=sum(NewSeekers))``` — mustafaakben, Jun 23 '20 at 14:03

Chi-Square Test of Independence in r

ANOVA test

2 Answers2

Update