Say I have a dataframe with two columns like this:
Label 1 | Label 2 |
---|---|
A | B |
A | C |
B | C |
C | A |
The values of A, B, and C in the first column are the same values of A, B, and C in the 2nd column. I want the encoding to look like this:
Label 1 | Label 2 | is_A | is_B | is_C |
---|---|---|---|---|
A | B | 1 | 1 | 0 |
A | C | 1 | 0 | 1 |
B | C | 0 | 1 | 1 |
C | A | 1 | 0 | 1 |
Basically, I just want it to check if a value shows up in either column. If so, then code a 1, if not then code a 0.
Now, I know I could write this using an if_else
, like this:
df <- df %>% mutate(is_A = if_else(label1 == 'A' | label2 == 'A'),
is_B = if_else(label1 == 'B' | label2 == 'B'),
is_C = if_else(label1 == 'C' | label2 == 'C'))
but I have many different categories and don't want to write out 50+ if_else statements. I've also tried this:
encoded_labels <- model.matrix(~ label1 + label2 - 1, data = df)
but this creates separate encodings for label1A vs. label2A, etc. Is there a simpler way to do this?