Splitting a character column containing multiple logical features of different length into separate logical columns

Question

I have a dataset I'm working on, and one of the columns contains multiple features that are separated by a comma. The number of features in each observation varies.

df <- data.frame(x=c("a", "a,b,c", "a,c", "b,c", "", "b"))

 x
1       a
2   a,b,c
3     a,c
4     b,c
5      
6       b

I want to split this into multiple logical columns like this:

where each column would represent if the observation contained that string in the original column. How can this be achieved? Is there a way to do it without specifying the output columns? For example, what if an observation contains:

"a,b,d"

How can I do it in a way that captures all unique features of the original column?

G. Grothendieck · Answer 1 · 2018-03-22T00:30:35.970

1

First split each item into the list s and compute the unique levels levs. Then use outer to create the desired matrix tab and add column names.

s <- strsplit(as.character(df$x), ",")
levs <- unique(unlist(s))
tab <- outer(s, levs, Vectorize(function(x, y) y %in% x)) + 0
colnames(tab) <- levs

giving:

> tab
     a b c
[1,] 1 0 0
[2,] 1 1 1
[3,] 1 0 1
[4,] 0 1 1
[5,] 0 0 0
[6,] 0 1 0

edited Mar 22 '18 at 00:30

answered Mar 22 '18 at 00:18

G. Grothendieck

254,981
17
203
341

Wow that is exactly what I was trying to accomplish, thank you so much! – 4HTP Mar 22 '18 at 00:29

score 0 · Answer 2 · answered Mar 22 '18 at 00:43

0

d=strsplit(as.character(df$x),",")
> m=xtabs(z~x+y,data.frame(x=rep(df$x,lengths(d)),y=unlist(d),z=1))
> as.data.frame.matrix(m)
      a b c
      0 0 0
a     1 0 0
a,b,c 1 1 1
a,c   1 0 1
b     0 1 0
b,c   0 1 1

answered Mar 22 '18 at 00:43

Onyambu

67,392
3
24
53

Splitting a character column containing multiple logical features of different length into separate logical columns

2 Answers2