How can I one-hot-encode multiple columns in R that share categories?

Question

Say I have a dataframe with two columns like this:

Label 1	Label 2
A	B
A	C
B	C
C	A

The values of A, B, and C in the first column are the same values of A, B, and C in the 2nd column. I want the encoding to look like this:

Label 1	Label 2	is_A	is_B	is_C
A	B	1	1	0
A	C	1	0	1
B	C	0	1	1
C	A	1	0	1

Basically, I just want it to check if a value shows up in either column. If so, then code a 1, if not then code a 0.

Now, I know I could write this using an if_else, like this:

df <- df %>% mutate(is_A = if_else(label1 == 'A' | label2 == 'A'), 
is_B = if_else(label1 == 'B' | label2 == 'B'), 
is_C = if_else(label1 == 'C' | label2 == 'C'))

but I have many different categories and don't want to write out 50+ if_else statements. I've also tried this:

encoded_labels <- model.matrix(~ label1 + label2 - 1, data = df)

but this creates separate encodings for label1A vs. label2A, etc. Is there a simpler way to do this?

Onyambu · Accepted Answer · 2023-05-25T16:09:03.670

in base R you could Try:

cbind(df, unclass(table(row(df), unlist(df))))

  Label_1 Label_2 A B C
1       A       B 1 1 0
2       A       C 1 0 1
3       B       C 0 1 1
4       C       A 1 0 1

Another way:

cbind(df, +sapply(unique(unlist(df)), grepl, do.call(paste, df)))

  Label_1 Label_2 A B C
1       A       B 1 1 0
2       A       C 1 0 1
3       B       C 0 1 1
4       C       A 1 0 1

Note that for the table you should do:

+unclass(table(row(df), unlist(df))>0)

This will take into consideration rows that have multiple values

If you want to use model.matrix:

+Reduce("|", split(data.frame(model.matrix(~values+0, stack(df))), col(df)))
  valuesA valuesB valuesC
1       1       1       0
2       1       0       1
3       0       1       1
4       1       0       1

How can I one-hot-encode multiple columns in R that share categories?

1 Answers1