R: t-test between rows within each factor level

Question

This is the data frame I'm trying to work on:

m <- matrix(rnorm(108, mean = 5000, sd = 1000), nrow = 36) 
colnames(m) <- paste('V', 1:3, sep = '') 
df <- data.frame(type = factor(rep(c('T1', 'T2', 'T3', 'T4', 'T5', 
            'T6', 'T7', 'T8', 'T9'), each = 4)), 
            treatment = factor(rep(rep(c('C','P', 'N', 'S'), each = 1), 
            9)), 
            as.data.frame(m))

I want to know how can I perform a t-test between the rows within each "type". Here's an example of t-tests for type T1 I want:

t.test(df[1,3:5], df[2, 3:5])
t.test(df[1,3:5], df[3, 3:5])
t.test(df[1,3:5], df[4, 3:5])

t.test(df[1,3:5], df[3, 3:5])

t.test(df[1,3:5], df[4, 3:5])

I'm trying to figure out how can I loop through all rows and get all the p-values from the t-test (along with the type and treatment for identification), instead of calculating each row manually. Any help or suggestion would be greatly appreciated.

The dimensions of `m` and the colnames(m) <- is not matching i.e. there are only 3 columns — akrun, Sep 22 '17 at 20:14
Sorry about that, just fixed it. It's supposed to be 3 columns. — Tpg333, Sep 22 '17 at 20:16
You could try `library(data.table); setDT(df)[, {d1 <- .SD[treatment == "C"][rep(1, 3)]; d2 <- .SD[treatment != "C"]; unlist(lapply(seq_len(nrow(d1)), function(i) t.test(d1[i], d2[i])$p.value))}, type, .SDcols = V1:V3]` Make sure to do some adjustment for p value — akrun, Sep 22 '17 at 20:26
@akrun This doesn't give you all combinations of `treatments` within each `type`, but nonetheless, very good solution. — acylam, Sep 22 '17 at 21:18

acylam · Accepted Answer · 2017-09-29T20:09:13.653

0

Something like this:

library(dplyr)
t_tests = df %>%
  split(.$type) %>%
  lapply(function(x){
    t(x[3:5]) %>%
      data.frame %>%
      setNames(x$treatment) %>%
      combn(2, simplify = FALSE) %>%
      lapply(function(x){
         data.frame(treatment = paste0(names(x), collapse = ", "), 
                   p_value = t.test(x[,1], x[,2])$p.value)
      }) %>%
      do.call(rbind, .) 
  }) %>% 
  do.call(rbind, .) %>%
  mutate(type = sub("[.].+", "", row.names(.)))

Result:

> head(t_tests, 10)
   treatment   p_value type
1       C, P 0.6112274   T1
2       C, N 0.6630060   T1
3       C, S 0.5945135   T1
4       P, N 0.9388568   T1
5       P, S 0.8349370   T1
6       N, S 0.9049995   T1
7       C, P 0.3274583   T2
8       C, N 0.9755364   T2
9       C, S 0.7391661   T2
10      P, N 0.3177871   T2

Edits (Added an extra level "file" to the dataset):

library(dplyr)
t_tests = df %>%
  split(.$file) %>%
  lapply(function(y){
    split(y, y$type) %>%
    lapply(function(x){
      t(x[4:6]) %>%
        data.frame %>%
        setNames(x$treatment) %>%
        combn(2, simplify = FALSE) %>%
        lapply(function(x){
          data.frame(treatment = paste0(names(x), collapse = ", "), 
                     p_value = t.test(x[,1], x[,2])$p.value)
        }) %>%
        do.call(rbind, .) 
    }) %>% 
      do.call(rbind, .) %>%
      mutate(type = sub("[.].+", "", row.names(.)))
  }) %>% 
  do.call(rbind, .) %>%
  mutate(file = sub("[.].+", "", row.names(.)))

Result:

   treatment   p_value type  file
1       C, P 0.3903450   T1 file1
2       C, N 0.3288727   T1 file1
3       C, S 0.0638599   T1 file1
4       P, N 0.6927599   T1 file1
5       P, S 0.1159615   T1 file1
6       N, S 0.2184015   T1 file1
7       C, P 0.1147805   T2 file1
8       C, N 0.4961888   T2 file1
9       C, S 0.9048607   T2 file1
10      P, N 0.4203666   T2 file1
11      P, S 0.3425908   T2 file1
12      N, S 0.7262478   T2 file1
13      C, P 0.6300293   T3 file1
14      C, N 0.8255837   T3 file1
15      C, S 0.7140522   T3 file1
16      P, N 0.4768694   T3 file1
17      P, S 0.3992130   T3 file1
18      N, S 0.8740219   T3 file1
19      C, P 0.2434270   T4 file1
20      C, N 0.2713622   T4 file1

Note about edit:

OP wanted an extra top level file to be added to the data, one can simply add another split + lapply and do.call at the end.

New Data:

m <- matrix(rnorm(324, mean = 5000, sd = 1000), nrow = 108) 
colnames(m) <- paste('V', 1:3, sep = '') 
df <- data.frame(type = factor(rep(c('T1', 'T2', 'T3', 'T4', 'T5', 'T6', 'T7', 'T8', 'T9'), each = 4)), 
                 treatment = factor(rep(rep(c('C','P', 'N', 'S'), each = 1), 9)), 
                 file = factor(rep(c("file1", "file2", "file3"), each = 36)), 
                 as.data.frame(m))

edited Sep 29 '17 at 20:09

answered Sep 22 '17 at 21:03

acylam

18,231
5
36
45

Thank you so much, the code works beautifully! As a beginner in R, i'm wondering would there be a way to modify the code so it'd work dynamically with any number of type and treatment? – Tpg333 Sep 28 '17 at 21:28
@Tpg333 This is already generalized to work for any number of type and treatment. I only showed 10 rows of the resulting dataframe to not clutter my answer. – acylam Sep 29 '17 at 03:40
Thanks for getting back to me. Sorry for the confusion. I meant to say, what if all `type` don't get the same `treatment`. Let's say T3 only gets treatment C, P, N; T8 gets only 2 treatments P and S, but the rest of the `type` get all 4 treatments. Also, v1, v2, v3 columns might change, meaning it could be just 3 columns or 5 columns (v1, v2, v3, v4, v5). So I'm wondering would there be a way to make the code works dynamically based on the data frame structure? – Tpg333 Sep 29 '17 at 16:17
@Tpg333 In theory, this should work for what you specified above. `t.test` only takes two vectors, and `combn` automatically expands your treatments to all combinations possible. So it doesn't matter how many treatments each type gets and how many types there are. Are you asking because you have tried a different dataset and my code didn't work? Maybe you can post that new dataset you have in mind and I can see if it works. – acylam Sep 29 '17 at 16:42
1

You're right, your code still works with a different number of types and treatments. It was my fault, I wasn't testing it on a correct dataset. The actual dataset looks like this. – Tpg333 Sep 29 '17 at 19:03
`m <- matrix(rnorm(324, mean = 5000, sd = 1000), nrow = 108) colnames(m) <- paste('V', 1:3, sep = '') df <- data.frame(type = factor(rep(c('T1', 'T2', 'T3', 'T4', 'T5', 'T6', 'T7', 'T8', 'T9'), each = 4)), treatment = factor(rep(rep(c('C','P', 'N', 'S'), each = 1), 9)), file = factor(rep(c("file1", "file2", "file3"), each = 36)), as.data.frame(m)) ` Each file has 36 rows, the three files are combined inside one dataset. How can I modify your code so it'd also split the 3 files for the output? – Tpg333 Sep 29 '17 at 19:05
@Tpg333 See my edits. Since `file` is at the top level, you can just wrap the entire code with another `split` + `lapply` and `do.call` at the end. This becomes a complicated if you add even more layers though. Im sure there are better ways of doing this than multiple nested `lapply`. – acylam Sep 29 '17 at 20:11
1

Thank you so much, you're a life-saver! I've been working on this for days. The code works perfectly. Btw, I tried to vote your answer, but my reputation is not high enough to show up. – Tpg333 Sep 29 '17 at 20:35
@Tpg333 Accepting my answer would be perfectly fine:) You can do that my clicking on the grey check mark under the down arrow. – acylam Sep 29 '17 at 20:46

R: t-test between rows within each factor level

1 Answers1

Edits (Added an extra level "file" to the dataset):