R: Is there method for finding all possible values of a categorical value in a data-frame when rows can belong to more than one category?

Question

I'm trying to inspect a dataset to understand all of the different categorical qualities a dataset can take on.

The actual data set I'm using has 100,000+ rows and I have no idea whats in it

For simplicity's To illustrate, for the following df:

a<-(1:10)
b<-c("a,b","c,d","c","c","a","a,d","b,d","c","c","a")
example_df <- data.frame(a,b)
example_df

I would like a function that will return: a,b,c,d

I have tried using "unique" function, but this doesn't work, because it returns combinations:

uni <- unique(example_df$b)
uni
[1] a,b c,d c   a   a,d b,d
Levels: a a,b a,d b,d c c,d

Doe anyone know of a solution for this?

akrun · Answer 1 · 2020-11-21T23:36:30.923

1

We can split the 'b' column by , into a list, unlist to a vector and get the unique elements

unique(unlist(strsplit(as.character(example_df$b), ",")))
#[1] "a" "b" "c" "d"

edited Nov 21 '20 at 23:36

answered Nov 21 '20 at 23:26

akrun

874,273
37
540
662

Thanks! I tried, this and received this error, however: : "non-character argument", do you have any insight about this? – PortMadeleineCrumpet Nov 21 '20 at 23:35
@PortMadeleineCrumpet May be you have a `factor` column. You can change it to `character` with `as.character` Updated the post. From R 4.0, by default, `stringsAsFactors = FALSE` – akrun Nov 21 '20 at 23:37

score 0 · Answer 2 · answered Nov 22 '20 at 01:34

You can use separate_rows to divide the data into separate rows and use distinct to get unique values.

library(dplyr)

example_df %>%
  mutate(b = as.character(b)) %>%
  tidyr::separate_rows(b,sep = ',') %>%
  distinct(b)

#   b    
#  <chr>
#1 a    
#2 b    
#3 c    
#4 d

R: Is there method for finding all possible values of a categorical value in a data-frame when rows can belong to more than one category?

2 Answers2