5

I have a dataframe with the following structure

test <- data.frame(col = c('a; ff; cc; rr;', 'rr; a; cc; e;'))

Now I want to create a dataframe from this which contains a named column for each of the unique values in the test dataframe. A unique value is a value ended by the ';' character and starting with a space, not including the space. Then for each of the rows in the column I wish to fill the dummy columns with either a 1 or a 0. As given below

data.frame(a = c(1,1), ff = c(1,0), cc = c(1,1), rr = c(1,0), e = c(0,1))

  a ff cc rr e
1 1  1  1  1 0
2 1  0  1  1 1

I tried creating a df using for loops and the unique values in the column but it's getting to messy. I have a vector available containing the unique values of the column. The problem is how to create the ones and zeros. I tried some mutate_all() function with grep() but this did not work.

Sotos
  • 51,121
  • 6
  • 32
  • 66
Michael
  • 1,281
  • 1
  • 17
  • 32

7 Answers7

9

I'd use splitstackshape and mtabulate from qdapTools packages to get this as a one liner, i.e.

library(splitstackshape)
library(qdapTools)

mtabulate(as.data.frame(t(cSplit(test, 'col', sep = ';', 'wide'))))
#   a cc ff rr e
#V1 1  1  1  1 0
#V2 1  1  0  1 1

It can also be full splitstackshape as @A5C1D2H2I1M1N2O1R2T1 mentions in comments,

cSplit_e(test, "col", ";", mode = "binary", type = "character", fill = 0)
Sotos
  • 51,121
  • 6
  • 32
  • 66
6

Here's a possible data.table implementation. First we split the rows into columns, melt into a single column and the spread it wide while counting the events for each row

library(data.table)
test2 <- setDT(test)[, tstrsplit(col, "; |;")]
dcast(melt(test2, measure = names(test2)), rowid(variable) ~ value, length)
#    variable a cc e ff rr
# 1:        1 1  1 0  1  1
# 2:        2 1  1 1  0  1
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
4

Here's a base R approach:

x   <- strsplit(as.character(test$col), ";\\s?") # split the strings
lvl <- unique(unlist(x))                         # get unique elements
x   <- lapply(x, factor, levels = lvl)           # convert to factor
t(sapply(x, table))                              # count elements and transpose
#     a ff cc rr e
#[1,] 1  1  1  1 0
#[2,] 1  0  1  1 1
talat
  • 68,970
  • 21
  • 126
  • 157
3

We can do this with tidyverse

library(tidyverse)
rownames_to_column(test, 'grp') %>%
        separate_rows(col) %>% 
        filter(col!="")  %>% 
        count( grp, col) %>%
        spread(col, n, fill = 0) %>%
        ungroup() %>% 
        select(-grp)
# A tibble: 2 × 5
#      a    cc     e    ff    rr
#* <dbl> <dbl> <dbl> <dbl> <dbl>
#1     1     1     0     1     1
#2     1     1     1     0     1
akrun
  • 874,273
  • 37
  • 540
  • 662
3

Here is a base R solution. First remove the space. Get all the unique combination. Split the actual data frame and then check presence of it in the cols which will have all the combo. Then you get a logical matrix which can be easily converted into numeric.

test=as.data.frame(apply(test,2,function(x)gsub('\\s+', '',x)))
cols=unique(unlist(strsplit(as.character(test$col), split = ';'))) 
yy=strsplit(as.character(test$col), split = ';') 
z=as.data.frame(do.call.rbind(lapply(yy, function(x) cols %in% x)))
names(z)=cols
z=as.data.frame(lapply(z, as.integer))
Chirayu Chamoli
  • 2,076
  • 1
  • 17
  • 32
3

Another approach with tidytext and tidyverse

library(tidyverse)
library(tidytext) #for unnest_tokens()
df <- test %>%
    unnest_tokens(word, col) %>%
    rownames_to_column(var="row") %>%
    mutate(row = floor(parse_number(row)),
           val = 1) %>%
    spread(word, val, fill = 0) %>%
    select(-row)
df
#    a cc e ff rr
#1   1  1 0  1  1
#2   1  1 1  0  1
Karthik Arumugham
  • 1,300
  • 1
  • 11
  • 18
1

Another simple solution without any extra packages:

x =  c('a; ff; cc; rr;', 'rr; a; cc; e;')
G = lapply(strsplit(x,';'), trimws)
dict = sort(unique(unlist(G)))
do.call(rbind, lapply(G, function(g) 1*sapply(dict, function(d) d %in% g)))
Oleg Melnikov
  • 3,080
  • 3
  • 34
  • 65