2

Say I have a dataframe with two columns like this:

Label 1 Label 2
A B
A C
B C
C A

The values of A, B, and C in the first column are the same values of A, B, and C in the 2nd column. I want the encoding to look like this:

Label 1 Label 2 is_A is_B is_C
A B 1 1 0
A C 1 0 1
B C 0 1 1
C A 1 0 1

Basically, I just want it to check if a value shows up in either column. If so, then code a 1, if not then code a 0.

Now, I know I could write this using an if_else, like this:

df <- df %>% mutate(is_A = if_else(label1 == 'A' | label2 == 'A'), 
is_B = if_else(label1 == 'B' | label2 == 'B'), 
is_C = if_else(label1 == 'C' | label2 == 'C'))

but I have many different categories and don't want to write out 50+ if_else statements. I've also tried this:

encoded_labels <- model.matrix(~ label1 + label2 - 1, data = df)

but this creates separate encodings for label1A vs. label2A, etc. Is there a simpler way to do this?

user276238
  • 107
  • 6

1 Answers1

4

in base R you could Try:

cbind(df, unclass(table(row(df), unlist(df))))

  Label_1 Label_2 A B C
1       A       B 1 1 0
2       A       C 1 0 1
3       B       C 0 1 1
4       C       A 1 0 1

Another way:

cbind(df, +sapply(unique(unlist(df)), grepl, do.call(paste, df)))

  Label_1 Label_2 A B C
1       A       B 1 1 0
2       A       C 1 0 1
3       B       C 0 1 1
4       C       A 1 0 1

Note that for the table you should do:

+unclass(table(row(df), unlist(df))>0)

This will take into consideration rows that have multiple values

If you want to use model.matrix:

+Reduce("|", split(data.frame(model.matrix(~values+0, stack(df))), col(df)))
  valuesA valuesB valuesC
1       1       1       0
2       1       0       1
3       0       1       1
4       1       0       1
Onyambu
  • 67,392
  • 3
  • 24
  • 53