0

I'm interested in how results may differ when means are compared by using a two sample t-test as opposed to a one sample t-test from the same data. Two assessments where "counts" are sampled randomly from across multiple populations ("Group_ID") using two different techniques ("source") are compared. A result of p>0.05 indicates that the means are not statistically different (i.e. both assessment techniques give similar results). Raw data for source=A is reliable and always available. Raw data for source=B may not be available but a mean will always be provided. Using a test data set I want to examine how the t.test outcomes differ by using the one sample t.test as opposed to the two sample t.test.

Using dplyr and broom functions I have determined how to do multiple two-sample t-tests across a number of cases ("Group_ID") where the two assessments were conducted. Data from two sources is merged to create a three column data frame containing raw counts ("count") identified by one of two sources ("source").

glimpse(Data)

Observations: 2,552
Variables: 3
$ Group_ID    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ count       <dbl> 7, 8, 5, 5, 7, 3, 4, 2, 8, 11, 12, 1, 3, 5, 5, 12, 1, 5...
$ source      <chr> "B", "B", "B", "B", "B", "B", "B", "B",...

The two-sample test compares the means from both sources and considers variation in both data sets using the following:

Data_Stats <- Data %>% 
    group_by(Group_ID) %>%
    do(tidy(t.test(count ~ source, alt="two.sided", conf=0.95, var.eq=FALSE, paired=FALSE, data = .)))

Results in

Observations: 38
Variables: 11
Groups: Group_ID [38]
$ Group_ID <fct> 1, 1029, 1032, 1033, 1041, 1044, 1064, 1065, 1067, 1080, 1081, 1083, 1084, 117, 127, 180, 2...
$ estimate    <dbl> -0.4250000, -6.5000000, -1.1944444, 0.3437500, -5.2250000, -1.4375000, -1.6250000, -1.48387...
$ estimate1   <dbl> 5.250000, 9.166667, 5.833333, 6.156250, 5.375000, 3.937500, 2.075000, 6.000000, 4.108108, 9...
$ estimate2   <dbl> 5.675000, 15.666667, 7.027778, 5.812500, 10.600000, 5.375000, 3.700000, 7.483871, 6.540541,...
$ statistic   <dbl> -0.42469044, -3.42643903, -1.19922603, 0.32509809, -3.36599817, -1.94947775, -2.47005992, -...
$ p.value     <dbl> 6.723526e-01, 1.480949e-03, 2.355386e-01, 7.463614e-01, 1.509467e-03, 5.649284e-02, 1.57612...
$ parameter   <dbl> 70.66653, 38.05593, 55.46167, 54.07284, 47.94418, 53.45682, 75.67387, 44.38453, 49.87894, 6...
$ conf.low    <dbl> -2.420560, -10.340117, -3.190125, -1.776090, -8.346179, -2.916196, -2.935370, -3.969856, -5...
$ conf.high   <dbl> 1.570559944, -2.659882919, 0.801236360, 2.463590202, -2.103821060, 0.041195928, -0.31462953...
$ method      <chr> "Welch Two Sample t-test", "Welch Two Sample t-test", "Welch Two Sample t-test", "Welch Two...
$ alternative <chr> "two.sided", "two.sided", "two.sided", "two.sided", "two.sided", "two.sided", "two.sided", ...

I know I can get means for each case using:

Data_means <- Data %>% 
    group_by(Group_ID) %>% 
    summarize(count_mean = mean(count))

I'm looking for a suggestion for how best to use the mean for each Group_ID from source=2 as the value for the "mu= " call in the t.test function to compare to the mean of source=1 using a one-sample t-test for each of 38 different Group_IDs?

1 Answers1

0

This is a problem that doesn't lend itself awfully well to data wrangling with dplyr, and it's probably easier to create the tests using split.data.frame and lapply, but here's how you might go about it without splicing different data frames together.

First, I need some reproducible data along the same lines as the data in your question:

library(tidyverse)

set.seed(69)
df <- data.frame(Group_ID = factor(rep(1:4, each = 40)),
                 count = sample(10, 160, T) + rep(1:2, 80) + rep(1:4, each = 40),
                 source = factor(rep(c("A", "B"), 80)))

Now we can get the two-sample p values in a similar way to the method you used. We then use the trick of taking the mean counts per source within each ID, then ungrouping the data frame and duplicating the averages but shifting them so the "B" averages are in the "A" rows. Following, that, we can use this as the mean in a one-sample t-test for "A". When we summarize, we have both 1-sample and 2-sample data.

df                                                                      %>%
group_by(Group_ID)                                                      %>%
mutate(two_group_pval = t.test(count ~ source)$p.value)                 %>%
group_by(Group_ID, source)                                              %>%
mutate(mean_A = mean(count))                                            %>%
arrange(source, .by_group = T)                                          %>%
group_by(Group_ID)                                                      %>%
mutate(mean_B = lead(mean_A, length(which(source == "B"))))             %>%
filter(source == "A")                                                   %>%
group_by(Group_ID)                                                      %>%
mutate(one_group = t.test(count, mu = mean(mean_B, na.rm = T))$p.value) %>%
summarise(observations = length(count),
          mean_A = mean(mean_A, na.rm = T),
          mean_B = mean(mean_B, na.rm = T),
          one_sample_p_value = mean(one_group),
          two_sample_p_value = mean(two_group_pval))

#> # A tibble: 4 x 6
#>   Group_ID observations mean_A mean_B one_sample_p_value two_sample_p_value
#>   <fct>           <int>  <dbl>  <dbl>              <dbl>              <dbl>
#> 1 1                  20    6.6   8.65            0.00341             0.0201
#> 2 2                  20    8.3   9.85            0.0364              0.0999
#> 3 3                  20    9.9  11.5             0.0103              0.0600
#> 4 4                  20   10.4  11.4             0.122               0.290 

You'll notice the p values are higher for the two-sample test in my data. That's because the samples were drawn from a uniform distribution rather than a bell curve, so the assumptions of the t-test aren't met. You should check your own data has an approximately normal distribution before relying on either the one-sample or two-sample t-tests. If they aren't normal, you should switch to the wilcox.test.


EDIT

Based on the OP's requirements being made clearer, the above code won't work for the given example of two levels for Group_ID. Here's how you could tackle the problem in base R for arbitrary levels using self-explanatory code:

multi_ss_t_test <- function(x, y) as.numeric(t.test(x$count, mu = y)$p.value)
multi_ts_t_test <- function(x, y) as.numeric(t.test(x$count, y$count)$p.value)

source_dfs <- split.data.frame(df, df$source)

A_groups   <- split.data.frame(source_dfs$A, source_dfs$A$Group_ID)
B_groups   <- split.data.frame(source_dfs$B, source_dfs$B$Group_ID)

B_means    <- tapply(source_dfs$B$count, source_dfs$B$Group_ID, mean)
A_means    <- tapply(source_dfs$A$count, source_dfs$A$Group_ID, mean)

ss_pvals   <- mapply(multi_ss_t_test, A_groups, B_means)
ts_pvals   <- mapply(multi_ts_t_test, A_groups, B_groups)

result     <- data.frame(group = levels(df$Group_ID),
                         source_A_mean = A_means,
                         source_B_means = B_means,
                         one_sample_pval = ss_pvals,
                         two_sample_pval = ts_pvals)

Now if I supply a data frame with 38 levels, you can see that it will output both one-sample and two-sample p values for each Group_ID:

set.seed(69)
df <- data.frame(Group_ID = factor(rep(rep(1:38, each = 20), 2)),
                 count = c(sample(10:40, 760, T), sample(12:42, 760, T)),
                 source = rep(c("A", "B"), each = 760))

Run through the above program, this gives:

as_tibble(result)
#> # A tibble: 38 x 5
#>    group source_A_mean source_B_means one_sample_pval two_sample_pval
#>    <fct>         <dbl>          <dbl>           <dbl>           <dbl>
#>  1 1              24.5           25.6          0.620           0.724 
#>  2 2              23.6           25.0          0.407           0.608 
#>  3 3              24.2           29.7          0.0154          0.0438
#>  4 4              26             22.6          0.123           0.264 
#>  5 5              26.6           25.6          0.531           0.683 
#>  6 6              26.8           24.7          0.303           0.414 
#>  7 7              25.5           26            0.807           0.852 
#>  8 8              24.3           28.9          0.0167          0.0887
#>  9 9              23.6           26.6          0.137           0.255 
#> 10 10             25.8           27.1          0.533           0.651 
#>#... with 28 more rows

A more efficient approach if you just want the p values would be:

get_pvals <- function(x)
{
  c(one_sample_p_value = t.test(x$count ~ x$source)$p.value,
    two_sample_p_value = t.test(x$count[x$source == "A"], 
                                mu = mean(x$count[x$source == "B"]))$p.value) 
}

split(df, list(group = df$Group_ID)) %>% sapply(get_pvals) %>% t %>% as.data.frame()
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Thank you so much. This code works well as long as I use these generic variable names and source = A or B. Unfortunately when I try to use the same code with the variable names in my dataframe that have meaning to me, the code hangs up and will not complete. I don't get any error message but the 4 x 6 Tibble does not output. – D. Reid Jan 27 '20 at 19:02
  • Are you definitely replacing every instance of "Group_ID", "source" and "count" with your own column names? Or do your actual factors have more than 2 levels? – Allan Cameron Jan 27 '20 at 19:15
  • Group_ID has 38 levels, there are 38 stands where we collected the "A" data. The variable 'source' has only two levels – D. Reid Jan 27 '20 at 20:16
  • OK @D.Reid my new base R solution should work for you. – Allan Cameron Jan 27 '20 at 21:20