I have a technical question related to my df structure. It looks like this:
Month District Age Gender Education Disability Religion Occupation JobSeekers GMI
1 2020-01 Dan U17 Male None None Jewish Unprofessional workers 2 0
2 2020-01 Dan U17 Male None None Muslims Sales and costumer service 1 0
3 2020-01 Dan U17 Female None None Other Undefined 1 0
4 2020-01 Dan 18-24 Male None None Jewish Production and construction 1 0
5 2020-01 Dan 18-24 Male None None Jewish Academic degree 1 0
6 2020-01 Dan 18-24 Male None None Jewish Practical engineers and technicians 1 0
ACU NACU NewSeekers NewFiredSeekers
1 0 2 0 0
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
5 0 1 0 0
6 0 1 1 1
And I'm looking for a way to make an Chi-Square Test of Independence between 2 variables like district and JobSeekers so i can tell if Northern district related to jobseekers more than the southern for example. As far as i can tell, something is wrong with the data structure (District is a char and jobseekers is an integer which indicate how many jobseekers I have based on District, Gender, Occupation etc) I tried to subset it to district and jobseekers like this:
Month District JobSeekers GMI ACU NACU NewSeekers NewFiredSeekers
<chr> <chr> <int> <int> <int> <int> <int> <int>
1 2020-01 Dan 33071 4694 9548 18829 6551 4682
2 2020-01 Jerusalem 21973 7665 3395 10913 3589 2260
3 2020-01 North 47589 22917 4318 20354 6154 3845
4 2020-01 Sharon 25403 6925 4633 13845 4131 2727
5 2020-01 South 37089 18874 2810 15405 4469 2342
6 2020-02 Dan 32660 4554 9615 18491 5529 3689
But it makes it harder to handle I'll accept any other test that will work of course.
Please help and let me know if there's more information that you need,
Moshe
Update
# t test for district vs new seekers
# sorting
dist.newseek <- Cdata %>%
group_by(Month,District) %>%
summarise(NewSeekers=sum(NewSeekers))
# performing a t test on the mini table we created
t.test(NewSeekers ~ District,data=subset(dist.newseek,District %in% c("Dan","South")))
# results
Welch Two Sample t-test
data: NewSeekers by District
t = 0.68883, df = 4.1617, p-value = 0.5274
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-119952.3 200737.3
sample estimates:
mean in group Dan mean in group South
74608.25 34215.75
#wilcoxon test
# filtering Cdata to New seekers based on month and age
age.newseek <- Cdata %>%
group_by(Month,Age) %>%
summarise(NewSeekers=sum(NewSeekers))
#performing a wilcoxon test on the subset
wilcox.test(NewSeekers ~ Age,data=subset(age.newseek,Age %in% c("25-34","45-54")))
# Results
Wilcoxon rank sum exact test
data: NewSeekers by Age
W = 11, p-value = 0.4857
alternative hypothesis: true location shift is not equal to 0
ANOVA test
# Sorting occupation and month by new seekers
occu.newseek <- Cdata %>%
group_by(Month,Occupation) %>%
summarise(NewSeekers=sum(NewSeekers))
## Make the Occupation as a factor
occu.newseek$District <- as.factor(occu.newseek$Occupation)
## Get the occupation group means and standart deviations
group.mean.sd <- aggregate(
x = occu.newseek$NewSeekers, # Specify data column
by = list(occu.newseek$Occupation), # Specify group indicator
FUN = function(x) c('mean'=mean(x),'sd'= sd(x))
)
## Run one way ANOVA test
anova_one_way <- aov(NewSeekers~ Occupation, data = occu.newseek)
summary(anova_one_way)
## Run the Tukey Test to compare the groups
TukeyHSD(anova_one_way)
## Check the mean differences across the groups
library(ggplot2)
ggplot(occu.newseek, aes(x = Occupation, y = NewSeekers, fill = Occupation)) +
geom_boxplot() +
geom_jitter(shape = 15,
color = "steelblue",
position = position_jitter(0.21)) +
theme_classic()