6

I want to subset my df to include only columns that include a certain value in any row.

for example, if I have:

year = c(1990,1991,1992,1993,1994,1995,1996,1997,1998,1999)
apple = c(1,4,6,8,9,9,2,4,7,4)
orange = c(7,1,5,5,2,1,7,1,3,8)
banana = c(9,9,4,8,1,3,6,7,5,9)
lemon = c(8,3,3,3,2,5,6,7,2,4)
df = data.frame(year,apple,orange,banana,lemon)

df

I want to select only the columns that have a 9 anywhere in the column so that my df would become only include the apple and banana columns.

Is this possible? All the answers I've found so far only enable selecting columns based on the column name, but I want to select based on cell values within the column. Thank you!

Emily
  • 75
  • 5

4 Answers4

6

We can pass a function in select within where - check whether the column is numeric and if that is numeric, check whether there are any value equal to 9. In addition can change the any(.x ==9) to 9 %in% .x.

library(dplyr)
df %>% 
  select(where(~is.numeric(.x) && any(.x == 9)))

-output

 apple banana
1      1      9
2      4      9
3      6      4
4      8      8
5      9      1
6      9      3
7      2      6
8      4      7
9      7      5
10     4      9
akrun
  • 874,273
  • 37
  • 540
  • 662
  • So `where` is necessary with select to quasi "filter" columns. Interesting that there is no `filter` for columns?! – TarJae Jun 21 '22 at 16:52
  • 1
    @TarJae in `dplyr`, `select` does the selection of columns, whereas `filter` does the subsetting of rows. In `subset` from `base R`, also, we have `subset` and `select` for row/column. whereas `Filter` in `base R`, does the selection of columns if it is a data.frame – akrun Jun 21 '22 at 16:54
  • 1
    @TarJae according to `?where` `where() takes a function and returns all variables for which the function returns TRUE:` – akrun Jun 21 '22 at 16:56
3

base R option using Filter:

Filter(function(x) any(x == 9), df)

Output:

   apple banana
1      1      9
2      4      9
3      6      4
4      8      8
5      9      1
6      9      3
7      2      6
8      4      7
9      7      5
10     4      9
Quinten
  • 35,235
  • 5
  • 20
  • 53
3

This one is long and too verbose and the perfect answer is already provided by @akrun. Therefore here an alternative approach: What we do here is mutate across each column except year check if they contain a 9 if so then put the name of column to a new column called x,

then use any_of with select

library(dplyr)
library(tidyr)

df %>% 
  mutate(across(-year, ~case_when(. == 9 ~ cur_column()), .names = 'new_{col}')) %>%
  unite(x, starts_with('new'), na.rm = TRUE, sep = ' ') %>% 
  select(any_of(x))


   banana apple
1       9     1
2       9     4
3       4     6
4       8     8
5       1     9
6       3     9
7       6     2
8       7     4
9       5     7
10      9     4
TarJae
  • 72,363
  • 6
  • 19
  • 66
3

Probably we can use colMeans like below, where the mean value should be non-zero if there exists at least one 9 in the column

> df[colMeans(df == 9) > 0]
   apple banana
1      1      9
2      4      9
3      6      4
4      8      8
5      9      1
6      9      3
7      2      6
8      4      7
9      7      5
10     4      9
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81