2

R doesn't perform a t.test when there are too few observations. However, I need to compare two surveys, where one survey has information on all items, whereas in the other it is lacking in some variables. This leads to a t.test comparison of e.g. q1 from NA (group 1) with values (group 2).

Basically, I need to find out how the t.test is performed anyway but reports an error if the requirements are lacking. I need to perform multiple t.tests at the same time (q1-q4) with grouping variable group and report the p.values to an output file.

Thanks for your help!

#create data
          surveydata <- as.data.frame(replicate(1,sample(1:5,1000,rep=TRUE)))
          colnames(surveydata)[1] <- "q1"
          surveydata$q2 <- sample(6, size = nrow(surveydata), replace = TRUE)
          surveydata$q3 <- sample(6, size = nrow(surveydata), replace = TRUE)
          surveydata$q4 <- sample(6, size = nrow(surveydata), replace = TRUE)
          surveydata$group <- c(1,2)

#replace all value "6" wir NA
          surveydata[surveydata == 6] <- NA

#add NAs to group 1 in q1
          surveydata$q1[which(surveydata$q1==1 & surveydata$group==1)] = NA
          surveydata$q1[which(surveydata$q1==2 & surveydata$group==1)] = NA
          surveydata$q1[which(surveydata$q1==3 & surveydata$group==1)] = NA
          surveydata$q1[which(surveydata$q1==4 & surveydata$group==1)] = NA
          surveydata$q1[which(surveydata$q1==5 & surveydata$group==1)] = NA

#perform t.test    
svy_sel <- c("q1", "q2", "q3", "q4", "group") #vector for selection
temp    <-    surveydata %>% 
              dplyr::select(svy_sel) %>% 
              tidyr::gather(key = variable, value = value, -group) %>%
              dplyr::mutate(value = as.numeric(value)) %>%
              dplyr::group_by(group, variable) %>% 
              dplyr::summarise(value = list(value)) %>%      
              tidyr::spread(group, value) %>%     #convert from “long” to “wide” format
              dplyr::group_by(variable) %>%       #t-test will be applied to each member of this group (ie., each variable).
              dplyr::mutate(p_value = t.test(unlist(1), unlist(2))$p.value, na.action = na.exclude)
Boombardeiro
  • 105
  • 8
  • Are you sure a T-test is the appropriate statistical test, considering your data? Either way, you should look into the `na.action` argument for `t.test()` – mhovd Sep 07 '20 at 19:28
  • I already tried to do so, but without success. I updated the data provided with "my" approach. – Boombardeiro Sep 07 '20 at 19:39

2 Answers2

3

Here's a base R way to get a tidy data frame of your results:

do.call(rbind, lapply(names(surveydata)[1:4], function(i) {
  tryCatch({
     test <- t.test(as.formula(paste(i, "~ group")), data = surveydata)
     data.frame(question = i, 
                group1 = test$estimate[1], 
                group2 = test$estimate[2], 
                difference = diff(test$estimate),
                p.value = test$p.value, row.names = 1)
     }, error = function(e) {
            data.frame(question = i, 
                group1 = NA,
                group2 = NA, 
                difference = NA,
                p.value = NA, row.names = 1)
     })
  }))

#>    question   group1   group2 difference    p.value
#> 1        q1       NA       NA         NA         NA
#> 11       q2 2.893720 3.128878 0.23515847 0.01573623
#> 12       q3 3.020930 3.038278 0.01734728 0.85905665
#> 13       q4 3.024213 3.066998 0.04278444 0.65910949

I'm not going to get into the debate about whether t tests are appropriate for Likert type data. I think the consensus is that with decent sized samples this should be OK.

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
1

You could also still do this with dplyr if you wrote a little function that would calculate the test if there was enough data. Here's the function that takes the entries from the dataset and calculates the p-value.

ttfun <- function(v1, v2, ...){
  tmp <- data.frame(x = unlist(v1), 
                    y = unlist(v2))
  tmp <- na.omit(tmp)
  if(nrow(tmp) < 2){
    pv <- NA
  }
  else{
    pv <- t.test(tmp$x,tmp$y, ...)$p.value
  }
  pv
}

Then, you could just call that in your last call to mutate():

svy_sel <- c("q1", "q2", "q3", "q4", "group") #vector for selection
temp    <-    surveydata %>% 
  dplyr::select(svy_sel) %>% 
  tidyr::gather(key = variable, value = value, -group) %>%
  dplyr::mutate(value = as.numeric(value)) %>%
  dplyr::group_by(group, variable) %>% 
  dplyr::summarise(value = list(value)) %>%      
  tidyr::spread(group, value) %>%     #convert from “long” to “wide” format
  dplyr::group_by(variable) %>%       #t-test will be applied to each member of this group (ie., each variable).
  dplyr::rename('v1'= '1', 'v2' = '2') %>% 
  dplyr::mutate(p_value = ttfun(v1, v2))

> temp
# # A tibble: 4 x 4
# # Groups:   variable [4]
#   variable v1          v2          p_value
#   <chr>    <list>      <list>        <dbl>
# 1 q1       <dbl [500]> <dbl [500]>  NA    
# 2 q2       <dbl [500]> <dbl [500]>   0.724
# 3 q3       <dbl [500]> <dbl [500]>   0.549
# 4 q4       <dbl [500]> <dbl [500]>   0.355
DaveArmstrong
  • 18,377
  • 2
  • 13
  • 25