4

Ok straight to the question. I have a database with lots and lots of categorical variable.

Sample database with a few variables as below

gender <- as.factor(sample( letters[6:7], 100, replace=TRUE, prob=c(0.2, 0.8) ))    
smoking <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.6,0.4)))    
alcohol <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.3,0.7)))    
htn <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.2,0.8)))    
tertile <- as.factor(sample(c(1,2,3),size=100,replace=T,prob=c(0.3,0.3,0.4)))    
df <- as.data.frame(cbind(gender,smoking,alcohol,htn,tertile))

I want to test the hypothesis, using a chi square test, that there is a difference in the portion of smokers, alcohol use, hypertension (htn) etc by tertile (3 factors). I then want to extract the p values for each variable.

Now i know i can test each individual variable using a 2 by 3 cross tabulation but is there a more efficient code to derive the test statistic and p-value across all variables in one go and extract the p value across each variable

Thanks in advance

Anoop

user3919790
  • 557
  • 1
  • 4
  • 10
  • 1
    Exactly what type of statistical test do you want to perform here? Knowing that should help us tell you how to implement it. There are many, many ways to complete p-values depending on the test you want to use (and some are more statistically appropriate than others). If you you're not sure which test you ought to perform, you may wish to seek statistical advice on [stats.se] first. – MrFlick Sep 29 '14 at 19:33
  • Hi there, sorry for the lack of clarification. Its a chi square test. Have updated the question. – user3919790 Sep 29 '14 at 21:01
  • So you want 4 different 2-way chi squares test p-values in this example? – MrFlick Sep 29 '14 at 21:16
  • yes exactly but rather than running the test four times, is there a way to loop the code so that R does it automatically across all the categorical variables? – user3919790 Sep 29 '14 at 21:24

2 Answers2

3

If you want to do all the comparisons in one statement, you can do

mapply(function(x, y) chisq.test(x, y)$p.value, df[, -5], MoreArgs=list(df[,5]))
#    gender   smoking   alcohol       htn 
# 0.4967724 0.8251178 0.5008898 0.3775083 

Of course doing tests this way is somewhat statistically inefficient since you are doing multiple tests here so some correction is required to maintain an appropriate type 1 error rate.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • "some correction is required to maintain an appropriate type 1 error rate." - could one divide the alpha level by the number of tests performed in order to mitigate this issue? – orrymr Feb 22 '19 at 11:55
  • 1
    @orrymr That’s called a Bonferroni correction. It is pretty common but it is also very conservative. You can google it to find out more. – MrFlick Feb 22 '19 at 14:48
1

You can run the following code chunk if you want to get the test result in details:

lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE))

You can get just p-values:

lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE)$p.value)

This is to get the p-values in the data frame:

data.frame(lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE)$p.value))

Thanks to RPub for inspiring. http://www.rpubs.com/kaz_yos/1204

Mehmet Yildirim
  • 471
  • 1
  • 4
  • 17