-4

There are several StackOverflow posts about situation where t.test() in R produce an error saying "data are essentially constant", this is due to that there is not enough difference between the groups (there is no variation) to run the t.test(). (Correct me if there is something else)

I'm in this situation, and I would like to fix this buy altering my data the way the statistical features of the data don't change drastically, so the t-test stays correct. I was wondering what if I add some very little variation to the data (e.g. change 0.301029995663981 to 0.301029995663990), or what else can I do?

For example, this is my data:

# Create the data frame
data <- data.frame(Date = c("2021.08","2021.08","2021.09","2021.09","2021.09","2021.10","2021.10","2021.10","2021.11","2021.11","2021.11","2021.11","2021.11","2021.12","2021.12","2022.01","2022.01","2022.01","2022.01","2022.08","2022.08","2022.08","2022.08","2022.08","2022.09","2022.09","2022.10","2022.10","2022.10","2022.11","2022.11","2022.11","2022.11","2022.11","2022.12","2022.12","2022.12","2022.12","2023.01","2023.01","2023.01","2023.01","2021.08","2021.08","2021.09","2021.09","2021.09","2021.10","2021.10","2021.10","2021.11","2021.11","2021.11","2021.11","2021.11","2021.12","2021.12","2022.01","2022.01","2022.01","2022.01","2022.08","2022.08","2022.08","2022.08","2022.08","2022.09","2022.09","2022.09","2022.09","2022.10","2022.10","2022.10","2022.10","2022.11","2022.11","2022.11","2022.11","2022.11","2022.12","2022.12","2022.12","2022.12","2023.01","2023.01","2023.01","2023.01"),
Species = c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B"),
Site = c("Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something",
"Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something",
"Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something",
"Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something",
"Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something",
"Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something",
"Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something",
"Something","Something","Something","Something"),
Mean = c("0.301029995663981","1.07918124604762","0.698970004336019","1.23044892137827","1.53147891704226","1.41497334797082","1.7160033436348",
         "0.698970004336019","1.39794000867204","1","0.301029995663981","0.301029995663981","0.477121254719662","0.301029995663981","0.301029995663981",
         "0.301029995663981","0.477121254719662","0.301029995663981","0.301029995663981","0.845098040014257","0.301029995663981","0.301029995663981",
         "0.477121254719662","0.698970004336019","1.23044892137827","1.41497334797082","1.95904139232109","1.5910646070265","1.53147891704226",
         "1.14612803567824","1.57978359661681","1.34242268082221","0.778151250383644","0.301029995663981","0.301029995663981","0.477121254719662",
         "0.301029995663981","1.20411998265592","0.845098040014257","1.17609125905568","1.20411998265592","0.698970004336019","0.301029995663981",
         "0.698970004336019","0.698970004336019","0.903089986991944","1.14612803567824","0.301029995663981","0.602059991327962","0.301029995663981",
         "0.845098040014257","0.698970004336019","0.698970004336019","0.301029995663981","0.698970004336019","0.301029995663981","0.301029995663981",
         "0.301029995663981","0.477121254719662","0.301029995663981","0.301029995663981","0.301029995663981","0.301029995663981","0.301029995663981",
         "0.602059991327962","0.301029995663981","0.845098040014257","1.92941892571429","1.27875360095283","0.698970004336019","1.38021124171161",
         "1.20411998265592","1.38021124171161","1.14612803567824","1","1.07918124604762","1.17609125905568","0.845098040014257","0.698970004336019",
         "0.778151250383644","0.301029995663981","0.845098040014257","1.64345267648619","1.46239799789896","1.34242268082221","1.34242268082221",
         "0.778151250383644"))

After, I set the factors:

# Set factors
str(data)
data$Date<-as.factor(data$Date)
data$Site<-as.factor(data$Site)
data$Species<-as.factor(data$Species)
data$Mean<-as.numeric(data$Mean)
str(data)

When I try t.test():

compare_means(Mean ~ Species, data = data, group.b = "Date", method = "t.test")

This is the error:
Error in `mutate()`:
ℹ In argument: `p = purrr::map(...)`.
Caused by error in `purrr::map()`:
ℹ In index: 5.
ℹ With name: Date.2021.12.
Caused by error in `t.test.default()`:
! data are essentially constant
Run `rlang::last_trace()` to see where the error occurred.

Similarly, when I use this in ggplot:

ggplot(data, aes(x = Date, y = Mean, fill=Species)) +
  geom_boxplot()+
  stat_compare_means(data=data,method="t.test", label = "p.signif") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Warning message:
Computation failed in `stat_compare_means()`
Caused by error in `mutate()`:
ℹ In argument: `p = purrr::map(...)`.
Caused by error in `purrr::map()`:
ℹ In index: 5.
ℹ With name: x.5.
Caused by error in `t.test.default()`:
! data are essentially constant 

What is the best solution, which keeps the data still usable in t-test?

jpsmith
  • 11,023
  • 5
  • 15
  • 36
Thend
  • 95
  • 7
  • 1
    can you possibly attempt to motivate why this could possibly be useful ? I think most people would just understand that data that is the same, is the same, that t.test results are not worth having and ... move on with their lives. If you are looking for a roundabout way to automate tests, and want to merely avoid script failures, there are well established ways of doing so be they tryCatch related, or library(purrr) provided `safely` wrappers and the like ... You probably should look at those instead ? – Nir Graham Jun 01 '23 at 13:30
  • 1
    At first read this seems to violate fundamentals of data analysis and would be a very concerning practice. Why do you want to do this? – jpsmith Jun 01 '23 at 13:31
  • 4
    This isn't about no difference between the groups. The problem is that when `Date` is `2021.12`, there is no variation on `Mean` within either `Species` value. That means there is no information in the data to estimate the sampling variability of the mean within the species. As the other two comments suggest, arbitrarily inducing such variation isn't a good idea. – DaveArmstrong Jun 01 '23 at 13:35
  • 1
    Emphasizing DaveArmstrong's point, t-tests run fine when there is no difference between groups. We can run a t-test where the data for each group is identical: `x = rnorm(10); t.test(x, x)` will compute just fine and predictably give a p-value of 1: no evidence for a difference in means. – Gregor Thomas Jun 01 '23 at 13:46
  • @DaveArmstrong is there a way, I mean an R function, to find those groups where there is no variation? In my real situation, if i have a dataframe with 1 million line, how can i find these specific values? (If there is a way to find these, I can just alter those) – Thend Jun 01 '23 at 13:47
  • 2
    @Thend sure you can find them. Calculate the standard deviation or the variance and the groups with 0 standard deviation (or variance) are the ones with no variation. But **do not alter the data**, that is fraud. Instead, just don't do t-tests on those groups. – Gregor Thomas Jun 01 '23 at 13:52
  • 2
    Ultimately this is a statistical issue, not a specific R programming questions. For statistical advice, you should ask for help at [stats.se]. Don't think about it as looking for the "right R function", you need to find the "appropriate statistical method" – MrFlick Jun 01 '23 at 14:07
  • 1
    `data %>% group_by(Date, Species) %>% summarise(s=sd(Mean)) %>% filter(s == 0)` will find the groups with no variance. – DaveArmstrong Jun 01 '23 at 14:13
  • @DaveArmstrong This is a great idea, I will do accordingly. I really appreciate the help! – Thend Jun 01 '23 at 16:38
  • @GregorThomas I agree, I wont change the data, that's greatly affects the outcome of my downstream analysis. Thanks for the help! – Thend Jun 01 '23 at 16:41
  • @GregorThomas I agree, I wont change the data, that's greatly affects the outcome of my downstream analysis. Thanks for the help! – Thend Jun 01 '23 at 16:41

1 Answers1

1

Finding the sd of Mean for each Date-Species combination and then filtering out any Dates where any sd is 0 will do the trick. You could even just pipe the filtered data to compare_means():

library(dplyr)
library(ggpubr)
data <- data.frame(Date = c("2021.08","2021.08","2021.09","2021.09","2021.09","2021.10","2021.10","2021.10","2021.11","2021.11","2021.11","2021.11","2021.11","2021.12","2021.12","2022.01","2022.01","2022.01","2022.01","2022.08","2022.08","2022.08","2022.08","2022.08","2022.09","2022.09","2022.10","2022.10","2022.10","2022.11","2022.11","2022.11","2022.11","2022.11","2022.12","2022.12","2022.12","2022.12","2023.01","2023.01","2023.01","2023.01","2021.08","2021.08","2021.09","2021.09","2021.09","2021.10","2021.10","2021.10","2021.11","2021.11","2021.11","2021.11","2021.11","2021.12","2021.12","2022.01","2022.01","2022.01","2022.01","2022.08","2022.08","2022.08","2022.08","2022.08","2022.09","2022.09","2022.09","2022.09","2022.10","2022.10","2022.10","2022.10","2022.11","2022.11","2022.11","2022.11","2022.11","2022.12","2022.12","2022.12","2022.12","2023.01","2023.01","2023.01","2023.01"),
                   Species = c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
                               "A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
                               "B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B"),
                   Site = c("Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something",
                            "Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something",
                            "Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something",
                            "Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something",
                            "Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something",
                            "Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something",
                            "Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something","Something",
                            "Something","Something","Something","Something"),
                   Mean = c("0.301029995663981","1.07918124604762","0.698970004336019","1.23044892137827","1.53147891704226","1.41497334797082","1.7160033436348",
                            "0.698970004336019","1.39794000867204","1","0.301029995663981","0.301029995663981","0.477121254719662","0.301029995663981","0.301029995663981",
                            "0.301029995663981","0.477121254719662","0.301029995663981","0.301029995663981","0.845098040014257","0.301029995663981","0.301029995663981",
                            "0.477121254719662","0.698970004336019","1.23044892137827","1.41497334797082","1.95904139232109","1.5910646070265","1.53147891704226",
                            "1.14612803567824","1.57978359661681","1.34242268082221","0.778151250383644","0.301029995663981","0.301029995663981","0.477121254719662",
                            "0.301029995663981","1.20411998265592","0.845098040014257","1.17609125905568","1.20411998265592","0.698970004336019","0.301029995663981",
                            "0.698970004336019","0.698970004336019","0.903089986991944","1.14612803567824","0.301029995663981","0.602059991327962","0.301029995663981",
                            "0.845098040014257","0.698970004336019","0.698970004336019","0.301029995663981","0.698970004336019","0.301029995663981","0.301029995663981",
                            "0.301029995663981","0.477121254719662","0.301029995663981","0.301029995663981","0.301029995663981","0.301029995663981","0.301029995663981",
                            "0.602059991327962","0.301029995663981","0.845098040014257","1.92941892571429","1.27875360095283","0.698970004336019","1.38021124171161",
                            "1.20411998265592","1.38021124171161","1.14612803567824","1","1.07918124604762","1.17609125905568","0.845098040014257","0.698970004336019",
                            "0.778151250383644","0.301029995663981","0.845098040014257","1.64345267648619","1.46239799789896","1.34242268082221","1.34242268082221",
                            "0.778151250383644"))
data$Date<-as.factor(data$Date)
data$Site<-as.factor(data$Site)
data$Species<-as.factor(data$Species)
data$Mean<-as.numeric(data$Mean)

data %>% 
  group_by(Date, Species) %>% 
  mutate(s = sd(Mean)) %>% 
  group_by(Date) %>%
  filter(!any(s == 0)) %>% 
  compare_means(Mean ~ Species, data = ., group.b = "Date", method = "t.test")
#> # A tibble: 11 × 9
#>    Date    .y.   group1 group2      p p.adj p.format p.signif method
#>    <fct>   <chr> <chr>  <chr>   <dbl> <dbl> <chr>    <chr>    <chr> 
#>  1 2021.08 Mean  A      B      0.718   1    0.718    ns       T-test
#>  2 2021.09 Mean  A      B      0.451   1    0.451    ns       T-test
#>  3 2021.10 Mean  A      B      0.0889  0.89 0.089    ns       T-test
#>  4 2021.11 Mean  A      B      0.850   1    0.850    ns       T-test
#>  5 2022.01 Mean  A      B      1       1    1.000    ns       T-test
#>  6 2022.08 Mean  A      B      0.234   1    0.234    ns       T-test
#>  7 2022.09 Mean  A      B      0.670   1    0.670    ns       T-test
#>  8 2022.10 Mean  A      B      0.0707  0.78 0.071    ns       T-test
#>  9 2022.11 Mean  A      B      0.783   1    0.783    ns       T-test
#> 10 2022.12 Mean  A      B      0.399   1    0.399    ns       T-test
#> 11 2023.01 Mean  A      B      0.255   1    0.255    ns       T-test

Created on 2023-06-01 with reprex v2.0.2

DaveArmstrong
  • 18,377
  • 2
  • 13
  • 25