Is there an R function to delete duplicate values within a cell for a large dataset?

Question

I have a large dataset and my problem is the following: I have multiple entries within a cell. I want to delete the duplicates within cells, but keep the duplicate row entries for all columns. Here is an example Code that will lead so a situation similar to the one I am in right now:

          employee <- c('John_Doe|John Doe,','PeterGynn, Peter Gynn|Peter_Gynn','Jolie Hope','John Doe,','JohnDoe,','John Doe','John Doe')
id <- c(1,2,3,4,5,6,7)
salary <- c(21000, 23400, 26800, 666604, 55, 66, 22)
startdate <- as.Date(c('2010-11-1','2008-3-25',',2007-3-14','2007-3-14|','2007-3-14','2007-3-14','2007-3-14'))
employ.data <- data.frame(employee, id, salary, startdate)
View(employ.data)


employee1<- c('John Doe','Peter Gynn','Jolie Hope',"test1 test1 test1: test1", "test2","test2","test2","test2","test2")
id1 <-c(1,2,3,4,5,6,7,8,9)
salary1 <- c(21000, 55, 66,66,66,66,66,66,66)
startdate1 <- as.Date(c('2010-11-1','2008-3-22','2007-3-14',"2007-3-14","2007-3-14","2007-3-14","2007-3-14","2007-3-14","2007-3-14"))
employ.data1 <- data.frame(employee1, id1, salary1, startdate1)
names(employ.data1)[1] <- "employee"
names(employ.data1)[2] <- "id"
names(employ.data1)[3] <- "salary"
names(employ.data1)[4] <- "startdate"

Test <- merge(employ.data, employ.data1, by="id", all = TRUE) 

names(Test)
Test$employ.data.z <- paste(Test$employee.x,Test$employee.y) 
Test$salary.z <- paste(Test$salary.x,Test$salary.y) 
Test$startdate.z <- paste(Test$startdate.x,Test$startdate.y) 
Test$employee.x <- NULL
Test$employee.y <- NULL
Test$salary.x <- NULL
Test$salary.y <- NULL
Test$startdate.x <- NULL
Test$startdate.y <- NULL
names(Test)
View(Test)  #here you get an idea, what my dataset looks like right now
dput(Test) #probably unnecessary here but added it anyway just in case
Input <- Test


employee2 <- c('John Doe','Peter Gynn','Jolie Hope','John Doe','John Doe','John Doe','John Doe')
id2 <- c(1,2,3,4,5,6,7)
salary2 <- c(21000, 23400, 26800, 666604, 55, 66, 22)
startdate2 <- c('2010-11-1','2008-3-25',',2007-3-14','2007-3-14','2007-3-14','2007-3-14','2007-3-14')
employ.data2 <- data.frame(employee2, id2, salary2, startdate2)
View(employ.data2)
names(employ.data2)[1] <- "employee"
names(employ.data2)[2] <- "id"
names(employ.data2)[3] <- "salary"
names(employ.data2)[4] <- "startdate"

employee3<- c(' ',' ',' ',"test1", "test2","test2","test2","test2","test2")
id3 <-c(1,2,3,4,5,6,7,8,9)
salary3 <- c(" ", 55, 66,66,66," ",66,66,66)
startdate3 <- c(' ','2008-3-22',' '," "," "," "," "," "," ")
employ.data3 <- data.frame(employee3, id3, salary3, startdate3)
names(employ.data3)[1] <- "employee"
names(employ.data3)[2] <- "id"
names(employ.data3)[3] <- "salary"
names(employ.data3)[4] <- "startdate"

Output <- merge(employ.data2, employ.data3, by="id", all = TRUE) 
View(Output)
names(Output)
Output$employ.data.z <- paste(Output$employee.x,Output$employee.y) 
Output$salary.z <- paste(Output$salary.x,Output$salary.y) 
Output$startdate.z <- paste(Output$startdate.x,Output$startdate.y) 
Output$employee.x <- NULL
Output$employee.y <- NULL
Output$salary.x <- NULL
Output$salary.y <- NULL
Output$startdate.x <- NULL
Output$startdate.y <- NULL
View(Input)    #This is an example, similar to what I get with my dataset
View(Output)   #This is what I would want it to be like, if possible

The dataset has >1000 rows with 4 columns that to a high degree all have that issue. So manualy adressing them is no option.

I could not find a solution for duplicate issues like this within a cell. The entire column has similar entries, sometimes they are the same, sometimes different ones. So it would be important, that it's still possible for different rows to have similar entries. Also there are punctuation issues, which I want to adress afterwards, for now the duplicates are the bigger issue.

Any suggestions?

Kind Regards,

Hey, I edited the question, if anyone could help me now, I would really appreciate it. — Thisonesecret, Mar 04 '20 at 09:15
Hey, I fixed the error, it should run fine now. The Input is the first dataset (I named it Input) and what I would like to transform it to is the Output. I do not know how to get there, as I only manage to delete entire rows and couldn't find a way to modify/edit all cells (meaning deleting all kind of duplicates that are in-cell). — Thisonesecret, Mar 04 '20 at 12:45

score 0 · Answer 1 · answered Mar 04 '20 at 13:08

We can use separate_rows to split data into new rows. For employ.data.z, the data has lot of separators which can be included in the sep argument. For salary.z and startdate.z we can separate the rows on whitespace. Once we have the data in long format we can group_by id and create one comma-separated string for all the columns.

library(dplyr)
library(tidyr)

Input %>%
  separate_rows(employ.data.z, sep = '[|,:]') %>%
  separate_rows(salary.z, startdate.z, sep = "\\s+") %>%
  group_by(id) %>%
  summarise_at(vars(-group_cols()), ~toString(unique(.)))

# A tibble: 9 x 4
#     id employ.data.z                                 salary.z   startdate.z           
#  <dbl> <chr>                                         <chr>      <chr>                 
#1     1 John_Doe, John Doe,  John Doe                 21000      2010-11-01            
#2     2 PeterGynn,  Peter Gynn, Peter_Gynn Peter Gynn 23400, 55  2008-03-25, 2008-03-22
#3     3 Jolie Hope Jolie Hope                         26800, 66  NA, 2007-03-14        
#4     4 John Doe,  test1 test1 test1,  test1          666604, 66 2007-03-14            
#5     5 JohnDoe,  test2                               55, 66     2007-03-14            
#6     6 John Doe test2                                66         2007-03-14            
#7     7 John Doe test2                                22, 66     2007-03-14            
#8     8 NA test2                                      NA, 66     NA, 2007-03-14        
#9     9 NA test2                                      NA, 66     NA, 2007-03-14

Is there an R function to delete duplicate values within a cell for a large dataset?

1 Answers1