1

I am fairly new to R and I am trying to run a kruskal wallis test to see if there is a difference between three groups when looking at different genes. I have 3 groups and 127 proteins. I have been able to create a code that will do this,

sample_data"

    groups <- c("control","control","control","control","control","group1","group1","group1","group1","group1","group1","group1","group1","group1","group1","group1","group1","group1","group2","group2","group2","group2","group2","group2","group2","group2")
gene1 <- c(8,7,4,5,0,2,8,5,6,4,4,6,5,4,6,4,7,4,8,1,6,3,5,6,3,1)
gene2 <- c(8,10,10,9,7,5,8,10,8,9,10,9,6,9,8,7,8,7,8,9,9,7,7,6,9,8)
gene3 <- c(10,11,10,11,5,6,9,11,10,11,12,8,4,7,7,10,10,3,2,11,9,10,9,3,10,10)
gene4 <- c(4,4,3,2,0,2,4,4,3,3,4,1,1,1,4,4,3,2,3,4,4,1,4,3,2,2)
gene5 <- c(8,10,11,10,7,6,8,8,8,12,11,8,7,8,8,10,10,9,10,8,10,7,8,7,10,7)
mydata <- data.frame(groups,gene1,gene2,gene3,gene4,gene5)

    i <- 2  #ignore 1st column as this is not a "protein"
pval <-NULL
repeat{
    K <- kruskal.test(df[,i], df[,1], data = df, paired=FALSE, p.adjust.methods="none")
    pval <- c(as.matrix(sapply(K[3],as.numeric)),pval)
    i <- i+1
    if(i>ncol(df)){break}
}

unfortunately the pvalue obtained is different than what I get doing a kruskal wallis test on just one gene at a time. For example:

For Gene1, the pvalue obtained from the loop was 0.0389 but when I run kruskal.test(Gene1,group, data=df) I get a pvalue of 0.84.

I came across this because after doing the kruskal wallist test I proceeded with a pairwise Mann Whitney test and noticed that the "significant" pvalues for Kruskal wallis did not correlate with the "significant" pvalues for Mann Whitney.

Furthermore, I went on VassarStats and minitab and got a p-value of 0.84(adjustment for ties). I would like to know how I can run this Kruskal wallis test in a loop without the p-values being affected. Is there something I am not seeing that I am doing incorrectly?

Also, I have used getAnywhere(kruskal.test.default) that I saw in a previous post, but I can't find what would cause this to occur when performing the test over and over.

RRookie
  • 11
  • 4
  • Welcome to stackoverflow. You might get better answers to this question on https://stats.stackexchange.com/ – Simon.S.A. Apr 14 '20 at 21:02
  • You need to provide a sample of the data in usable form. Use `dput()` on a sample of the data, not a picture of the data. Any non-parametric test will have difficulty computing p-values when there are ties in the ranks (as there are in your data). Finally, your call to `kruskal.test` is wrong and should produce an error message unless you have used a version from a different package instead of the base function. – dcarlson Apr 14 '20 at 21:14
  • @dcarlson Hello, thank you for taking the time to look at my question. I have edited it and created a more usable sample. Also, as you pointed out, there are ties in my data but I don't understand why that effects the output when looping if I got the same answer using other online calculations. I used base R to run the test, this is the last version of the code that I used, I had added the ("paired=FALSE, p.adjust.methods="none") because I wanted to see if the pvalues would be similar to when the kruskal test is done one gene at a time. Is that the part that is wrong or my whole code? Thanks – RRookie Apr 15 '20 at 00:56

1 Answers1

0

I don't get your results when I isolate the kruskal test line.

df <- mydata
i <- 2
kruskal.test(df[,i], df[,1], data = df, paired=FALSE, p.adjust.methods="none")
# 
#   Kruskal-Wallis rank sum test
# 
# data:  df[, i] and df[, 1]
# Kruskal-Wallis chi-squared = 0.66988, df = 2, p-value = 0.7154

I think your pval assignment line is causing the problem. But you are not taking full advantage of R. First of all, always read the manual page for a function. The kruskal.test function does not take a data= argument unless you are specifying a formula, nor does it take a paired= argument (relevant only to tests between two groups), nor does it take a p.adjust.methods= argument. You are just guessing and wasting your time. Get an R tutorial and spend a day or so learning the basics. Your entire code can be expressed as follows:

pval <- sapply(2:6, function(x) kruskal.test(mydata[,x], mydata[,1])$p.value)
pval
# [1] 0.7153797 0.4424115 0.5360940 0.9816007 0.6118471
dcarlson
  • 10,936
  • 2
  • 15
  • 18