2

My data is in the shape

Event Id Var1 Var2 Var3
1      a   x    w    y
2      a   z    y    w
3      b   x    y    q

and I need to create multi-hot encoded vectors for each row in the table, considering all the values appearing in Var1, Var2 and Var3. Meaning that the desired output would be:

Event Id  x  y  z  w  q
1      a  1  1  0  1  0
2      a  0  1  1  1  0
3      b  1  1  0  0  1

Meaning that I keep the same number of rows of the initial dataset, I only add for each row a number of columns equal to all the unique factors among Var 1, Var 2 and Var3.

I tried all aproaches I could think of, but nothing seems to work so far..

Any idea?

M_Stones
  • 23
  • 3

2 Answers2

1

We could gather into 'long' format and then do a spread

library(tidyverse)
gather(df1, key, val, Var1:Var3) %>%
   mutate(n = 1) %>%
   select(-key) %>% 
   spread(val, n, fill = 0)

Or in a compact way

library(qdapTools)
cbind(df1[1:2], +(mtabulate(df1[-(1:2)]) > 0))
akrun
  • 874,273
  • 37
  • 540
  • 662
1

You can use data.table -

dt <- read.table(text="Event Id Var1 Var2 Var3
1      a   x    w    y
2      a   z    y    w
3      b   x    y    q",header=T)

setDT(dt)
dcast(setDT(melt(dt,id.vars = c("Event","Id")))[,ind:=1],Event+Id~value,value.var = "ind",fill=0)

Output-

   Event Id Var1 Var2 Var3 q w x y z
1:     1  a    1    1    1 0 1 1 1 0
2:     2  a    1    1    1 0 1 0 1 1
3:     3  b    1    1    1 1 0 1 1 0
Rushabh Patel
  • 2,672
  • 13
  • 34