2

I have a data frame with many categorical columns. I would like to count the number of distinct categories not equal to "bla". So for example:

> d1
# A tibble: 5 x 2
    x      y    
  <chr>  <chr>
1 yellow A    
2 green  A    
3 green  bla  
4 blue   B    
5 bla    bla  

How can I modify dplyr's

d1 %>% summarise_all(n_distinct)

to exclude the category "bla"? In this case, the answer should be 3 for column x and 2 for column y.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Omry Atia
  • 2,411
  • 2
  • 14
  • 27

3 Answers3

3

We can use filter_all to filter the rows from all the columns and then use n_distinct to get length of unique values.

library(dplyr)

d1 %>% 
   filter_all(all_vars(. != "bla")) %>% 
    summarise_all(n_distinct)

#  x y
#1 3 2
zx8754
  • 52,746
  • 12
  • 114
  • 209
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Using base::lengths():

lengths(lapply(d1, function(i) unique(i[ i != "bla" ])))
# x y 
# 3 2 
zx8754
  • 52,746
  • 12
  • 114
  • 209
0

Using data table

library(data.table)

d1 <- data.table(d1)

d1[!y %like% 'bla', .(count = .N, distinct = uniqueN(x)), by = .(y)]
fidelin
  • 310
  • 1
  • 8