Creating a t-test loop over a dataframe using an index

Question

So, let's say I have a 1000-row, 6-column dataframe, the columns are a1, a2, b1, b2, c1, c2. I want to run some t-tests using a's, b's, and c's and get an output df with 3 columns for the t-values of a-b-c and another three for the significance information for those values, making it a total of 6 columns. The problem I have is with rows, I want to loop over chunks of 20, rendering the output a (1000/20=)50-row, 6-column df.

I have already tried creating an index column for my inital df which repeats a 1 for the first 20 row, a 2 for the next 20 row and so on.

    convert_n <- function(df) {
    df <- df %T>% {.$n_for_t_tests = rep(c(1:(nrow(df)/20)), each = 20)}
    }
    df <- convert_n(df)

However, I can't seem to find a way to properly utilize the items in this column as indices for a "for" or any kind of loop.

Below you can see the relevant code for that creates a 1-row, 6-column df; I need to modify the [0:20] parts, create a loop that does this for 20 groups and binds them.

    t_test_a <- t.test(df$a1[0:20], dfff$a2[0:20], paired = T, conf.level 
    = 0.95)
    t_test_b <- t.test(df$b1[0:20], dfff$b2[0:20], paired = T, conf.level 
    = 0.95)
    t_test_c <- t.test(df$c1[0:20], dfff$c2[0:20], paired = T, conf.level 
    = 0.95)
    t_tests_df <- data.frame(t_a = t_test_a$statistic[["t"]], 
                             t_b = t_test_b$statistic[["t"]],
                             t_c = t_test_c$statistic[["t"]])

    t_tests_df <- t_tests_df %T>% {.$dif_significance_a = ifelse(.$t_a > 
                                   2, "YES", "NO")} %T>% 
                                  {.$dif_significance_b = ifelse(.$t_b > 
                                   2, "YES", "NO")} %T>% 
                                  {.$dif_significance_c = ifelse(.$t_c > 
                                   2, "YES", "NO")} %>% 
                                  dplyr::select(t_a, dif_significance_a, 
                                                t_b, dif_significance_b,
                                                t_c, dif_significance_c)

Thank you in advance for your help.

indexing starts with 1, so `df$a1[1:20]` and so on – jogo Aug 29 '19 at 12:24 — jogo, Aug 29 '19 at 12:24
Actually, this worked, too. – arthur_fleck Aug 29 '19 at 17:14 — arthur_fleck, Aug 29 '19 at 17:14

score 1 · Answer 1 · answered Aug 29 '19 at 13:32

This is not the most pretty but i did a for loop like this:

df <- data.frame(a1 = sample(1000, 1000),
                 a2 = sample(1000, 1000),
                 b1 = sample(1000, 1000),
                 b2 = sample(1000, 1000),
                 c1 = sample(1000, 1000),
                 c2 = sample(1000, 1000))


df_ttest <- data.frame(p_a = c(1:50),
                       t_a = c(1:50),
                       p_b = c(1:50),
                       t_b = c(1:50),
                       p_c = c(1:50),
                       t_c = c(1:50))

index <- 0:50*20

for(i in seq_along(index)) {
    df_ttest$p_a[i] =  t.test(df$a1[index[i] : index[i+1]])$p.value
    df_ttest$p_b[i] =  t.test(df$b1[index[i] : index[i+1]])$p.value
    df_ttest$p_c[i] =  t.test(df$c1[index[i] : index[i+1]])$p.value

    df_ttest$t_a[i] =  t.test(df$a1[index[i] : index[i+1]])$statistic
    df_ttest$t_b[i] =  t.test(df$b1[index[i] : index[i+1]])$statistic
    df_ttest$t_c[i] =  t.test(df$c1[index[i] : index[i+1]])$statistic
}

This gives a 50x6 dataframe with seperate columns of p and t values for every 20 row chunk of a, b and c.

You could even go further and make a nested for loop to cycle through each row in df_ttest to make this abit prettier.

The second part of this was exactly what I am looking for, I am sure it will work for me with some smal modifications, thanks! — arthur_fleck, Aug 29 '19 at 17:25
I actually couldn't make it work since the small modifications turned out to be "not that small"... I see the logic in your index usage but my t tests are to be paired and when I introduce another argument inside t.test such as "df$a2[index[i] : index[i+1]", I run into problems. — arthur_fleck, Aug 29 '19 at 18:44

score 1 · Accepted Answer · answered Aug 29 '19 at 13:55

You can use split() and sapply():

set.seed(42)

df <- data.frame(a1 = sample(1000, 1000), a2 = sample(1000, 1000),
                 b1 = sample(1000, 1000), b2 = sample(1000, 1000),
                 c1 = sample(1000, 1000), c2 = sample(1000, 1000))

group <- gl(50, 20)

D <- split(df, group)

myt <- function(Di) 
  with(Di, c(at=t.test(a1, a2)$statistic, ap=t.test(a1, a2)$p.value,
    bt=t.test(b1, b2)$statistic, bp=t.test(b1, b2)$p.value,
    ct=t.test(c1, c2)$statistic, cp=t.test(c1, c2)$p.value))

sapply(D, FUN=myt) ### or
t(sapply(D, FUN=myt))

Creating a t-test loop over a dataframe using an index

2 Answers2