dummy variable based on unique column interaction

Question

I have the following data and wish to create an $ID variable for each unique interaction between two columns

DATE <- c('V', 'V', 'W', 'W', 'X', 'X', 'Y', 'Y', 'Z', 'Z')
SEX <- rep(1:2, 5)
Blood_T1 <- c(3,4,3,3,4,3,1,6,3,4)
Blood_T2 <- c(4,3,3,3,3,4,6,1,4,3)
df1 <- data.frame(DATE, SEX, Blood_T1, Blood_T2)

When grouping by $DATE, I want to create a new dummy variable for each unique combination of $Blood_T1 and $Blood_T2 regardless of their order.

The desired out appears below:

I cant use the sum, as it does not always produce unique combinations. (See the part marked in yellow above for clarification)

I have tried the following commands but have not yet hit the nail on the head

with(df1, interaction(Blood_T1, Blood_T2))
as.numeric(as.factor(with(df1, paste(Blood_T1, Blood_T2))))
transform(df1, Cluster_ID = as.numeric(interaction(Blood_T1, Blood_T2, drop=TRUE)))

`interaction` does exactly what you say you want to do. Is it _really_ what you want to do? — Hong Ooi, Feb 16 '17 at 16:38

d.b · Answer 1 · 2017-02-16T16:43:20.823

You can actually sort the individual pairs ($Blood_T1 and $Blood_T2) and paste them together which is already a kind of ID

apply(df1, 1, function(x) paste(sort(x[3:4]), collapse = ""))
#[1] "34" "34" "33" "33" "34" "34" "16" "16" "34" "34"

If you want to further reduce it, you can treat it as a factor and obtain the numeric value

as.numeric(as.factor(apply(df1, 1, function(x) paste(sort(x[3:4]), collapse = ""))))
#[1] 3 3 2 2 3 3 1 1 3 3

You could throw in DATE too, if that is necessary

apply(df1, 1, function(x) paste(sort(x[c(1,3:4)]), collapse = ""))
#[1] "34V" "34V" "33W" "33W" "34X" "34X" "16Y" "16Y" "34Z" "34Z"

akrun · Accepted Answer · 2017-02-16T16:48:54.857

We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), get the pmin and pmax of the 'Blood_T1' and 'Blood_T2' columns, paste, it together, match the values with the unique elements to create the 'Unique_ID', then we group by 'DATE' and concatenate the sum of 'Blood_T1' and 'Blood_T2' to create the 'Sum' column

library(data.table)
setDT(df1)[, Unique_ID := {
          i1 <- paste(pmin(Blood_T1, Blood_T2), pmax(Blood_T1, Blood_T2))
           match(i1, unique(i1))}]
df1[, Sum := c(sum(Blood_T1), sum(Blood_T2)), DATE][]
#     DATE SEX Blood_T1 Blood_T2 Unique_ID Sum
#1:    V   1        3        4         1   7
#2:    V   2        4        3         1   7
#3:    W   1        3        3         2   6
#4:    W   2        3        3         2   6
#5:    X   1        4        3         1   7
#6:    X   2        3        4         1   7
#7:    Y   1        1        6         3   7
#8:    Y   2        6        1         3   7
#9:    Z   1        3        4         1   7
#10:   Z   2        4        3         1   7

The above can be also implemented in base R i.e. vectorized approach.

i1 <- with(df1, paste(pmin(Blood_T1, Blood_T2), pmax(Blood_T1, Blood_T2)))
df1$Unique_ID <- match(i1, unique(i1))

thank you, i guess I need to enroll into one of your course ASAP to reach your wizard knowledge. — lukeg, Feb 16 '17 at 16:33
@lukeg Thank you for the comment. But, I think you may be referring to Arun, the `data.table` developer (which I am not) — akrun, Feb 16 '17 at 16:36

dummy variable based on unique column interaction

2 Answers2