2

I have a dataframe with different variables that can share some factor levels:

      id     tag_1      tag_2      tag_3
1:    3      human      NA         artist
2:    5      human      NA         NA
3:    7      song       artist     human
4:    8      town       human      NA   

As you can see, "human" level is present in three different variables (tag_1, tag_2 and tag_3), and level "artist" in two variables (tag_2 and tag_3).

Knowing that each factor level is unique in a given row, I would like to group same levels into the same column, in order to have exclusive factor levels in each column:

      id     tag_1      tag_2      tag_3
1:    3      human      NA         artist
2:    5      human      NA         NA
3:    7      human      song       artist
4:    8      human      town       NA  

Is there a way to do so?

If not the alternative would be to sort in each row the levels by occurrences (occurrences within all dataframe)...

Thanks for any brilliant idea!

Tau
  • 173
  • 1
  • 8

1 Answers1

0

I'd do something like this:

library(data.table)
melt(setDT(mydf), "id", na.rm = TRUE)[
  , dcast(.SD, id ~ value, value.var = "value")]
#    id artist human song town
# 1:  3 artist human   NA   NA
# 2:  5     NA human   NA   NA
# 3:  7 artist human song   NA
# 4:  8     NA human   NA town

From there, it's easy to rename the columns if you really need them to be in a 'tag_number' format.

Or, just create a binary indicator:

melt(setDT(mydf), "id", na.rm = TRUE)[
  , dcast(.SD, id ~ value, value.var = "variable", fun = length, fill = 0)]
#    id artist human song town
# 1:  3      1     1    0    0
# 2:  5      0     1    0    0
# 3:  7      1     1    1    0
# 4:  8      0     1    0    1

The parallels in the "tidyverse" would be:

library(tidyverse)
mydf %>% 
  gather(key, val, -id, na.rm = TRUE) %>% 
  select(-key) %>% 
  spread(val, val)

mydf %>% 
  gather(key, val, -id, na.rm = TRUE) %>% 
  select(-key) %>% 
  mutate(value = 1) %>% 
  spread(val, value, fill = 0)
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Thanks A5C1D2H2I1M1N2O1R2T1, interesting way, but the problem is that I have almost 10k different levels, and I don't want to have 10k different variables. The objective, as you have guessed, is to make multilabel classification, but the one-hot encoding is to heavy for my case. I would like to restrict to 30 variables (each one with hundreds of levels), then proceed to 32bit binary encoding for each combination... (but still thinking to others solutions) – Tau Feb 05 '18 at 17:49