Dummify character column and find unique values

Question

I have a dataframe with the following structure

test <- data.frame(col = c('a; ff; cc; rr;', 'rr; a; cc; e;'))

Now I want to create a dataframe from this which contains a named column for each of the unique values in the test dataframe. A unique value is a value ended by the ';' character and starting with a space, not including the space. Then for each of the rows in the column I wish to fill the dummy columns with either a 1 or a 0. As given below

data.frame(a = c(1,1), ff = c(1,0), cc = c(1,1), rr = c(1,0), e = c(0,1))

  a ff cc rr e
1 1  1  1  1 0
2 1  0  1  1 1

I tried creating a df using for loops and the unique values in the column but it's getting to messy. I have a vector available containing the unique values of the column. The problem is how to create the ones and zeros. I tried some mutate_all() function with grep() but this did not work.

You should have `rr` twice IIRC. – David Arenburg Feb 22 '17 at 09:50 — David Arenburg, Feb 22 '17 at 09:50

Sotos · Answer 1 · 2017-03-16T14:48:04.533

9

I'd use splitstackshape and mtabulate from qdapTools packages to get this as a one liner, i.e.

library(splitstackshape)
library(qdapTools)

mtabulate(as.data.frame(t(cSplit(test, 'col', sep = ';', 'wide'))))
#   a cc ff rr e
#V1 1  1  1  1 0
#V2 1  1  0  1 1

It can also be full splitstackshape as @A5C1D2H2I1M1N2O1R2T1 mentions in comments,

cSplit_e(test, "col", ";", mode = "binary", type = "character", fill = 0)

edited Mar 16 '17 at 14:48

answered Feb 22 '17 at 10:40

Sotos

51,121
6
32
66

1

Nice one liner. – Chirayu Chamoli Feb 22 '17 at 11:52
1

The actual 1-liner would be `cSplit_e(test, "col", ";", mode = "binary", type = "character", fill = 0)` :-) – A5C1D2H2I1M1N2O1R2T1 Feb 23 '17 at 12:30

score 6 · Answer 2 · answered Feb 22 '17 at 09:51

Here's a possible data.table implementation. First we split the rows into columns, melt into a single column and the spread it wide while counting the events for each row

library(data.table)
test2 <- setDT(test)[, tstrsplit(col, "; |;")]
dcast(melt(test2, measure = names(test2)), rowid(variable) ~ value, length)
#    variable a cc e ff rr
# 1:        1 1  1 0  1  1
# 2:        2 1  1 1  0  1

talat · Answer 3 · 2017-02-22T10:31:31.760

4

Here's a base R approach:

x   <- strsplit(as.character(test$col), ";\\s?") # split the strings
lvl <- unique(unlist(x))                         # get unique elements
x   <- lapply(x, factor, levels = lvl)           # convert to factor
t(sapply(x, table))                              # count elements and transpose
#     a ff cc rr e
#[1,] 1  1  1  1 0
#[2,] 1  0  1  1 1

edited Feb 22 '17 at 10:31

answered Feb 22 '17 at 10:11

talat

68,970
21
126
157

score 3 · Answer 4 · answered Feb 22 '17 at 09:55

We can do this with tidyverse

library(tidyverse)
rownames_to_column(test, 'grp') %>%
        separate_rows(col) %>% 
        filter(col!="")  %>% 
        count( grp, col) %>%
        spread(col, n, fill = 0) %>%
        ungroup() %>% 
        select(-grp)
# A tibble: 2 × 5
#      a    cc     e    ff    rr
#* <dbl> <dbl> <dbl> <dbl> <dbl>
#1     1     1     0     1     1
#2     1     1     1     0     1

score 3 · Answer 5 · answered Feb 22 '17 at 10:10

Here is a base R solution. First remove the space. Get all the unique combination. Split the actual data frame and then check presence of it in the cols which will have all the combo. Then you get a logical matrix which can be easily converted into numeric.

test=as.data.frame(apply(test,2,function(x)gsub('\\s+', '',x)))
cols=unique(unlist(strsplit(as.character(test$col), split = ';'))) 
yy=strsplit(as.character(test$col), split = ';') 
z=as.data.frame(do.call.rbind(lapply(yy, function(x) cols %in% x)))
names(z)=cols
z=as.data.frame(lapply(z, as.integer))

Karthik Arumugham · Answer 6 · 2017-02-22T10:14:45.257

3

Another approach with tidytext and tidyverse

library(tidyverse)
library(tidytext) #for unnest_tokens()
df <- test %>%
    unnest_tokens(word, col) %>%
    rownames_to_column(var="row") %>%
    mutate(row = floor(parse_number(row)),
           val = 1) %>%
    spread(word, val, fill = 0) %>%
    select(-row)
df
#    a cc e ff rr
#1   1  1 0  1  1
#2   1  1 1  0  1

edited Feb 22 '17 at 10:14

answered Feb 22 '17 at 10:12

Karthik Arumugham

1,300
1
11
18

score 1 · Answer 7 · answered Feb 16 '18 at 01:14

1

Another simple solution without any extra packages:

x =  c('a; ff; cc; rr;', 'rr; a; cc; e;')
G = lapply(strsplit(x,';'), trimws)
dict = sort(unique(unlist(G)))
do.call(rbind, lapply(G, function(g) 1*sapply(dict, function(d) d %in% g)))

answered Feb 16 '18 at 01:14

Oleg Melnikov

3,080
3
34
65

Dummify character column and find unique values

7 Answers7

Linked

Related