How to remove columns that contain all the same value

Question

I have count data (columns) in the form of presence/absence (1/0) of various genes in different samples that belong to one of two categories. I am doing a Fisher's (fisher.test) for each gene, but I get an error whenever that gene is present (1) or absent (0) from all samples. How can I remove or skip these columns, or have the command fisher.test ignore or skip these genes and keep going?

Here is my sample data:

mydata <- data.frame(sampleID = c("A", "B", "C", "D", "E", "F", "G"),
                     category = c("high", "low", "high", "high", "low", "high", "low"),
                     Gene1 = c(1, 1, 0, 0, 0, 1, 1),
                     Gene2 = c(0, 1, 1, 1, 1, 1, 0),
                     Gene3 = c(0, 0, 0, 1, 1, 1, 1),
                     Gene4 = c(1, 1, 1, 1, 1, 1, 1)

Here is the loop code that someone helped me design, which applies the fisher.test to each gene:

library(dplyr)
library(tidyr)
library(broom)

mydata %>%
  select(-sampleID) %>%
  pivot_longer(cols = -category, names_to = "gene") %>%
  group_by(gene) %>%
  summarise(fisher_test = list(tidy(fisher.test(table(category, value))))) %>%
  unnest(fisher_test) %>%
  mutate(odds_ratio = exp(estimate)) %>% 
  select(-method, -alternative)

The error message I get when it encounters a gene that is present or absent from all samples:

Caused by error in `fisher.test()`:
! 'x' must have at least 2 rows and columns
Run `rlang::last_error()` to see where the error occurred.

Where can I insert this step into the loop above?

Note: It is not feasible to omit the genes manually, as there are hundreds of them.

With your example, I couldn't get any error message with the code. For your large data, may be `summarise(fisher_test = if(n() >2) list(tidy(fisher.test(table(category, value)))) else list(NA))` (it is better to provide an example that shows the error so that we can test it) — akrun, Mar 14 '23 at 16:36
@akrun, try the data set with Gene4 (added). Your solution gave me the same error. — ABee, Mar 14 '23 at 16:39

akrun · Accepted Answer · 2023-03-14T17:00:40.747

We could add select at the top to remove any numeric columns having a single unique observation (n_distinct(.x) == 1)

library(dplyr)
library(tidyr)
mydata %>% 
   select(!where(~ is.numeric(.x) && n_distinct(.x) == 1),-sampleID) %>%
 
  pivot_longer(cols = -category, names_to = "gene") %>%
  group_by(gene) %>%
  summarise(fisher_test = list(tidy(fisher.test(table(category, value))))) %>%
  unnest(fisher_test) %>%
  mutate(odds_ratio = exp(estimate)) %>% 
  select(-method, -alternative)

-output

# A tibble: 3 × 6
  gene  estimate p.value conf.low conf.high odds_ratio
  <chr>    <dbl>   <dbl>    <dbl>     <dbl>      <dbl>
1 Gene1    1.81        1  0.0469      176.        6.11
2 Gene2    0.707       1  0.00640      78.2       2.03
3 Gene3    1.81        1  0.0469      176.        6.11

How to remove columns that contain all the same value

1 Answers1