I have a large dataset and my problem is the following: I have multiple entries within a cell. I want to delete the duplicates within cells, but keep the duplicate row entries for all columns. Here is an example Code that will lead so a situation similar to the one I am in right now:
employee <- c('John_Doe|John Doe,','PeterGynn, Peter Gynn|Peter_Gynn','Jolie Hope','John Doe,','JohnDoe,','John Doe','John Doe')
id <- c(1,2,3,4,5,6,7)
salary <- c(21000, 23400, 26800, 666604, 55, 66, 22)
startdate <- as.Date(c('2010-11-1','2008-3-25',',2007-3-14','2007-3-14|','2007-3-14','2007-3-14','2007-3-14'))
employ.data <- data.frame(employee, id, salary, startdate)
View(employ.data)
employee1<- c('John Doe','Peter Gynn','Jolie Hope',"test1 test1 test1: test1", "test2","test2","test2","test2","test2")
id1 <-c(1,2,3,4,5,6,7,8,9)
salary1 <- c(21000, 55, 66,66,66,66,66,66,66)
startdate1 <- as.Date(c('2010-11-1','2008-3-22','2007-3-14',"2007-3-14","2007-3-14","2007-3-14","2007-3-14","2007-3-14","2007-3-14"))
employ.data1 <- data.frame(employee1, id1, salary1, startdate1)
names(employ.data1)[1] <- "employee"
names(employ.data1)[2] <- "id"
names(employ.data1)[3] <- "salary"
names(employ.data1)[4] <- "startdate"
Test <- merge(employ.data, employ.data1, by="id", all = TRUE)
names(Test)
Test$employ.data.z <- paste(Test$employee.x,Test$employee.y)
Test$salary.z <- paste(Test$salary.x,Test$salary.y)
Test$startdate.z <- paste(Test$startdate.x,Test$startdate.y)
Test$employee.x <- NULL
Test$employee.y <- NULL
Test$salary.x <- NULL
Test$salary.y <- NULL
Test$startdate.x <- NULL
Test$startdate.y <- NULL
names(Test)
View(Test) #here you get an idea, what my dataset looks like right now
dput(Test) #probably unnecessary here but added it anyway just in case
Input <- Test
employee2 <- c('John Doe','Peter Gynn','Jolie Hope','John Doe','John Doe','John Doe','John Doe')
id2 <- c(1,2,3,4,5,6,7)
salary2 <- c(21000, 23400, 26800, 666604, 55, 66, 22)
startdate2 <- c('2010-11-1','2008-3-25',',2007-3-14','2007-3-14','2007-3-14','2007-3-14','2007-3-14')
employ.data2 <- data.frame(employee2, id2, salary2, startdate2)
View(employ.data2)
names(employ.data2)[1] <- "employee"
names(employ.data2)[2] <- "id"
names(employ.data2)[3] <- "salary"
names(employ.data2)[4] <- "startdate"
employee3<- c(' ',' ',' ',"test1", "test2","test2","test2","test2","test2")
id3 <-c(1,2,3,4,5,6,7,8,9)
salary3 <- c(" ", 55, 66,66,66," ",66,66,66)
startdate3 <- c(' ','2008-3-22',' '," "," "," "," "," "," ")
employ.data3 <- data.frame(employee3, id3, salary3, startdate3)
names(employ.data3)[1] <- "employee"
names(employ.data3)[2] <- "id"
names(employ.data3)[3] <- "salary"
names(employ.data3)[4] <- "startdate"
Output <- merge(employ.data2, employ.data3, by="id", all = TRUE)
View(Output)
names(Output)
Output$employ.data.z <- paste(Output$employee.x,Output$employee.y)
Output$salary.z <- paste(Output$salary.x,Output$salary.y)
Output$startdate.z <- paste(Output$startdate.x,Output$startdate.y)
Output$employee.x <- NULL
Output$employee.y <- NULL
Output$salary.x <- NULL
Output$salary.y <- NULL
Output$startdate.x <- NULL
Output$startdate.y <- NULL
View(Input) #This is an example, similar to what I get with my dataset
View(Output) #This is what I would want it to be like, if possible
The dataset has >1000 rows with 4 columns that to a high degree all have that issue. So manualy adressing them is no option.
I could not find a solution for duplicate issues like this within a cell. The entire column has similar entries, sometimes they are the same, sometimes different ones. So it would be important, that it's still possible for different rows to have similar entries. Also there are punctuation issues, which I want to adress afterwards, for now the duplicates are the bigger issue.
Any suggestions?
Kind Regards,