How can I do a t-test between sub-set of data in dataframe in R?

Question

I have a df1 like this:

    Stabr                    Area_name Score1 Score2     POVALL_2018          Score3 
3      AL               Autauga County        2       2       7,587           13.8
4      AL               Baldwin County        2       2      21,069            9.8
7      AL                Blount County        2       1       7,527           13.2
8      AL               Bullock County        3       6       3,610           42.5
9      AL                Butler County        3       6       4,731           24.5
10     AL               Calhoun County        3       2      21,719           19.5
11     AL              Chambers County        6       5       6,181           18.7
12     AL              Cherokee County        2       6       4,180           16.3
13     AL               Chilton County        2       1       7,542           17.3
14     AL               Choctaw County        3      10       2,806           22.1
16     AL                  Clay County        9      10       2,285           17.6
17     AL              Cleburne County        8       4       2,356           16.0

I only care about columns score1 and score3. I would like to perform a simple t-test amongst to see if all the counties with a score1 of 2 have a different score3 compared to all the counties with a score1 of 3.

Very concretely, I would like to see if the mean of 13.8, 9.8, 13.2, 16.3, 17.3, is significantly different from the mean of 42.5, 24.5, 19.5, 22.1. How can I do this? I would like to ignore all rows that have a score1 different than 2 or 3.

How is this done?

dc37 · Accepted Answer · 2020-03-30T04:25:17.247

You can subset your dataframe and perform the t.test:

df1 <- subset(df, Score1 %in% 2:3)

   Stabr      Area_name Score1 Score2 POVALL_2018 Score3
1:    AL  AutaugaCounty      2      2       7,587   13.8
2:    AL  BaldwinCounty      2      2      21,069    9.8
3:    AL   BlountCounty      2      1       7,527   13.2
4:    AL  BullockCounty      3      6       3,610   42.5
5:    AL   ButlerCounty      3      6       4,731   24.5
6:    AL  CalhounCounty      3      2      21,719   19.5
7:    AL CherokeeCounty      2      6       4,180   16.3
8:    AL  ChiltonCounty      2      1       7,542   17.3
9:    AL  ChoctawCounty      3     10       2,806   22.1

And the perform the t.test:

t.test(Score3~Score1,data = df1)


    Welch Two Sample t-test

data:  Score3 by Score1
t = -2.4293, df = 3.3817, p-value = 0.08372
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -29.148945   3.008945
sample estimates:
mean in group 2 mean in group 3 
          14.08           27.15

As you do not have a lot of samples per group, I will (personally) prefer to use a non-parametric test such as the Mann-Whitney (with the function wilcox.test):

wilcox.test(Score3~Score1,data = df1)

    Wilcoxon rank sum test

data:  Score3 by Score1
W = 0, p-value = 0.01587
alternative hypothesis: true location shift is not equal to 0

EDIT: t.test based on a value of Score1 (OP's comment)

If you want to test all values < 3 and all values > or =3, you can add a variable with an ifelse statement such as:

df$Group <- ifelse(df$Score1 <3,"A","B")
t.test(Score3~Group,data = df)

    Welch Two Sample t-test

data:  Score3 by Group
t = -2.429, df = 7.6464, p-value = 0.04262
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -17.429041  -0.382388
sample estimates:
mean in group A mean in group B 
       14.08000        22.98571

Does it answer your question ?

Reproducible example:

structure(list(Stabr = c("AL", "AL", "AL", "AL", "AL", "AL", 
"AL", "AL", "AL", "AL", "AL", "AL"), Area_name = c("AutaugaCounty", 
"BaldwinCounty", "BlountCounty", "BullockCounty", "ButlerCounty", 
"CalhounCounty", "ChambersCounty", "CherokeeCounty", "ChiltonCounty", 
"ChoctawCounty", "ClayCounty", "CleburneCounty"), Score1 = c(2L, 
2L, 2L, 3L, 3L, 3L, 6L, 2L, 2L, 3L, 9L, 8L), Score2 = c(2L, 2L, 
1L, 6L, 6L, 2L, 5L, 6L, 1L, 10L, 10L, 4L), POVALL_2018 = c("7,587", 
"21,069", "7,527", "3,610", "4,731", "21,719", "6,181", "4,180", 
"7,542", "2,806", "2,285", "2,356"), Score3 = c(13.8, 9.8, 13.2, 
42.5, 24.5, 19.5, 18.7, 16.3, 17.3, 22.1, 17.6, 16)), row.names = c(NA, 
-12L), class = c("data.table", "data.frame"))

Thanks for the help. If I have about 50 in each group is it fair to use a t.test? — Evan, Mar 30 '20 at 04:19
Ok, i see. I was proposing the `wilcox.test` as an alternative if you have a smaller dataset. Glad that it work for you. — dc37, Mar 30 '20 at 04:19
If I want to instead not go by either a 2 or a 3, but by all values < 3 and all values > or =3, how would I do that? — Evan, Mar 30 '20 at 04:21
You can create a variable to group your data into two group based on an`ifelse` statement. I edited my answer to show you how. — dc37, Mar 30 '20 at 04:26
It is just two groups that I create with A is vScore 1 < 3 and B is for values of Score1 > = 3. You can read more about `ifelse` statement by writing ?ifelse in r console — dc37, Mar 30 '20 at 04:30

How can I do a t-test between sub-set of data in dataframe in R?

1 Answers1