I have a dataframe with multiple values per cell, and I want to find and return the values that are only in one column.
ID <- c("1","1","1","2","2","2","3","3","3")
locus <- c("A","B","C","A","B","C","A","B","C")
Acceptable <- c("K44 Z33 G49","K72 QR123","B92 N12 PQ16 G99","V3","L89 I203 UPA66 QF29"," ","K44 Z33 K72","B92 PQ14","J22 M43 VC78")
Unacceptable <- c("K44 Z33 G48","K72 QR123 B22","B92 N12 PQ16 G99","V3 N9 Q7","L89 I203 UPA66 QF29","B8","K44 Z33"," ","J22 M43 VC78")
df <- data.frame(ID,locus,Acceptable,Unacceptable)
I want to make another column, Unique_values, that returns all the unique values that are only present in Unacceptable, and that are not in Acceptable. So the output should be this.
I already have a poorly optimized method to find the duplicates between the two columns:
df$Duplicate_values <- do.call(paste, c(df[,c("Acceptable","Unacceptable")], sep=" "))
df$Duplicate_values = sapply(strsplit(df$Duplicate_values, ' '), function(i)paste(i[duplicated(i)]))
#this is for cleaning up the text field so that it looks like the other columns
df$Duplicate_values = gsub("[^0-9A-Za-z///' ]"," ",df$Duplicate_values)
df$Duplicate_values = gsub("character 0",NA,df$Duplicate_values)
df$Duplicate_values = gsub("^c ","",df$Duplicate_values)
df$Duplicate_values = gsub(" "," ",df$Duplicate_values)
df$Duplicate_values = trimws(df$Duplicate_values)
(if anyone knows a faster method to return these duplicates, please let me now!)
I cannot use this method to find the unique values however, because it would then also return the unique values of the Acceptable column, which I do not want.
Any suggestions?