R: How to return unique values between columns with multiple values per cell?

Question

I have a dataframe with multiple values per cell, and I want to find and return the values that are only in one column.

ID <- c("1","1","1","2","2","2","3","3","3")
locus <- c("A","B","C","A","B","C","A","B","C")
Acceptable <- c("K44 Z33 G49","K72 QR123","B92 N12 PQ16 G99","V3","L89 I203 UPA66 QF29"," ","K44 Z33 K72","B92 PQ14","J22 M43 VC78")
Unacceptable <- c("K44 Z33 G48","K72 QR123 B22","B92 N12 PQ16 G99","V3 N9 Q7","L89 I203 UPA66 QF29","B8","K44 Z33"," ","J22 M43 VC78")

df <- data.frame(ID,locus,Acceptable,Unacceptable)

dataframe

I want to make another column, Unique_values, that returns all the unique values that are only present in Unacceptable, and that are not in Acceptable. So the output should be this.

I already have a poorly optimized method to find the duplicates between the two columns:

df$Duplicate_values <- do.call(paste, c(df[,c("Acceptable","Unacceptable")], sep=" "))
df$Duplicate_values = sapply(strsplit(df$Duplicate_values, ' '), function(i)paste(i[duplicated(i)]))

#this is for cleaning up the text field so that it looks like the other columns
df$Duplicate_values = gsub("[^0-9A-Za-z///' ]"," ",df$Duplicate_values)
df$Duplicate_values = gsub("character 0",NA,df$Duplicate_values)
df$Duplicate_values = gsub("^c ","",df$Duplicate_values)
df$Duplicate_values = gsub("    "," ",df$Duplicate_values)
df$Duplicate_values = trimws(df$Duplicate_values)

(if anyone knows a faster method to return these duplicates, please let me now!)

I cannot use this method to find the unique values however, because it would then also return the unique values of the Acceptable column, which I do not want.

Any suggestions?

score 0 · Accepted Answer · answered Dec 01 '22 at 16:51

A similar approach using setdiff:

lA <- strsplit(df$Acceptable, " ")
lU <- strsplit(df$Unacceptable, " ")
df$Unique_values <- lapply(1:nrow(df), function(i) paste0(setdiff(lU[[i]], lA[[i]]), collapse = " "))
df
#>   ID locus          Acceptable        Unacceptable Unique_values
#> 1  1     A         K44 Z33 G49         K44 Z33 G48           G48
#> 2  1     B           K72 QR123       K72 QR123 B22           B22
#> 3  1     C    B92 N12 PQ16 G99    B92 N12 PQ16 G99              
#> 4  2     A                  V3            V3 N9 Q7         N9 Q7
#> 5  2     B L89 I203 UPA66 QF29 L89 I203 UPA66 QF29              
#> 6  2     C                                      B8            B8
#> 7  3     A         K44 Z33 K72             K44 Z33              
#> 8  3     B            B92 PQ14                                  
#> 9  3     C        J22 M43 VC78        J22 M43 VC78

Thank you very much, this works great! Is there by any chance also an argument instead of setdiff that returns the duplicates? — CasualNoob, Dec 02 '22 at 09:29
Just replace `setdiff` with `intersect` to get the common values. — jblood94, Dec 02 '22 at 11:34

R: How to return unique values between columns with multiple values per cell?

1 Answers1