2

I have a dataset I'm working on, and one of the columns contains multiple features that are separated by a comma. The number of features in each observation varies.

df <- data.frame(x=c("a", "a,b,c", "a,c", "b,c", "", "b"))

 x
1       a
2   a,b,c
3     a,c
4     b,c
5      
6       b

I want to split this into multiple logical columns like this:

  a b c
1 1 0 0
2 1 1 1
3 1 0 1
4 0 1 1
5 0 0 0
6 0 1 0

where each column would represent if the observation contained that string in the original column. How can this be achieved? Is there a way to do it without specifying the output columns? For example, what if an observation contains:

"a,b,d"

How can I do it in a way that captures all unique features of the original column?

4HTP
  • 21
  • 1

2 Answers2

1

First split each item into the list s and compute the unique levels levs. Then use outer to create the desired matrix tab and add column names.

s <- strsplit(as.character(df$x), ",")
levs <- unique(unlist(s))
tab <- outer(s, levs, Vectorize(function(x, y) y %in% x)) + 0
colnames(tab) <- levs

giving:

> tab
     a b c
[1,] 1 0 0
[2,] 1 1 1
[3,] 1 0 1
[4,] 0 1 1
[5,] 0 0 0
[6,] 0 1 0
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0
d=strsplit(as.character(df$x),",")
> m=xtabs(z~x+y,data.frame(x=rep(df$x,lengths(d)),y=unlist(d),z=1))
> as.data.frame.matrix(m)
      a b c
      0 0 0
a     1 0 0
a,b,c 1 1 1
a,c   1 0 1
b     0 1 0
b,c   0 1 1
Onyambu
  • 67,392
  • 3
  • 24
  • 53