Merge columns in data.frame after removal of duplicate strings

Question

I have a data.framedata of character vectors as follows.

x <- c("kal, Kon, Jor, Kara", "Bruce, Helena, Martha, Terry", "connor, oliver, Roy",  
       "Alan, Guy, Simon, Kyle")
y <- c("Mon, Cir, John, Jor", "Damian, Terry, Jason", "Mia, Roy", "John, Cary")
data <- data.frame(x,y, stringsAsFactors=FALSE)

I am trying to concatenate strings in the two columns x and y to a new column z. I want to remove the duplicates and sort the words separated by , before concatenating the strings in a row. I am able to achieve this as follows.

x <- strsplit(data$x, split=", ")
y <- strsplit(data$y, split=", ")
data$z <- sapply(1:length(x), function(i) paste(sort(union(x[[i]], y[[i]])), 
                                                collapse=", "))

Is there a faster way to do this without creating the intermediate lists, maybe using data.table?

score 5 · Answer 1 · edited Apr 18 '20 at 02:11

You could try a regex solution. But, this won't sort as you may wanted.

v1 <- paste(data[,1], data[,2], sep=", ")
data$z <- sub('(\\b\\S+\\b)(?=.*\\b\\1\\b.*),', "", v1, perl=TRUE)

The regex can be viewed at regex101

Other options include

library(splitstackshape)
library(data.table)
cbind(data[,1:2],cSplit(setDT(data)[, indx:=1:.N],
      c('x', 'y'), sep=",", 'long')[ ,
     list(z=toString(unique(na.omit(unlist(.SD))))),
                           by=indx][,indx:=NULL])

                                 x                    y
 #1:          kal, Kon, Jor, Kara  Mon, Cir, John, Jor
 #2: Bruce, Helena, Martha, Terry Damian, Terry, Jason
 #3:          connor, oliver, Roy             Mia, Roy
 #4:       Alan, Guy, Simon, Kyle           John, Cary
  #                                       z
 #1:         kal, Kon, Jor, Kara, Mon, Cir, John
 #2: Bruce, Helena, Martha, Terry, Damian, Jason
 #3:                    connor, oliver, Roy, Mia
 #4:          Alan, Guy, Simon, Kyle, John, Cary

Or using stringi package

 library(stringi)
 data$z <- vapply(stri_extract_all_regex(paste(data$x, data$y), '\\w+'),
                function(x) toString(sort(unique(x))), character(1))

Benchmarks

Based on on a not so big dataset,

 data <- data[rep(1:nrow(data), 3e4),]
 row.names(data) <- NULL

 cath <- function(){
       apply(data,1,function(vec){
                    paste(sort(unique(strsplit(paste(vec[1],
                   vec[2],sep=", "),", ")[[1]])),collapse=", ")
                  })
       }

 akrun2 <- function(){
         vapply(stri_extract_all_regex(paste(data$x, data$y), '\\w+'),
                    function(x) toString(sort(unique(x))), character(1))
      }

 akrun3 <- function(){
    v1 <- paste(data[,1], data[,2], sep=", ")
    sub('(\\b\\S+\\b)(?=.*\\b\\1\\b.*),', "", v1, perl=TRUE) 
   }

 microbenchmark(cath(), akrun2(), akrun3(),unit='relative', times=10L)
 #Unit: relative
 #   expr       min        lq      mean   median       uq      max neval cld
 # cath() 11.700071 11.979908 11.700118 11.76762 11.57583 11.40806    10   c
 #akrun2()  7.175622  7.225212  7.217322  7.19431  7.09539  7.31929    10  b 
 #akrun3()  1.000000  1.000000  1.000000  1.00000  1.00000  1.00000    10  a

@CathG Thanks, I considered posting as a new solution. But, then I thought that it will give a bad precedence for others to post multiple solutions as different posts and also will be a bit unfair as I might get double votes. — akrun, Dec 19 '14 at 15:47
you've got a point, besides, your solution really looks nice with the new edit :-) — Cath, Dec 19 '14 at 15:55

Cath · Accepted Answer · 2014-12-19T14:00:05.287

3

To go further with the idea you had, you can do, without creating intermediate lists :

data$z<-apply(data,1,function(vec){
                        paste(unique(strsplit(paste(vec[1],vec[2],sep=", "),", ")[[1]]),collapse=", ")
                      })

> data
                             x                    y                                           z
1          kal, Kon, Jor, Kara  Mon, Cir, John, Jor         kal, Kon, Jor, Kara, Mon, Cir, John
2 Bruce, Helena, Martha, Terry Damian, Terry, Jason Bruce, Helena, Martha, Terry, Damian, Jason
3          connor, oliver, Roy             Mia, Roy                    connor, oliver, Roy, Mia
4       Alan, Guy, Simon, Kyle           John, Cary          Alan, Guy, Simon, Kyle, John, Cary

although slower, base R is not that bad, based on the 3e4-row dataset of @akrun :

>  microbenchmark(cath(), akrun2(), unit='relative', times=100L)
Unit: relative
     expr      min       lq     mean   median       uq      max neval cld
   cath() 1.429732 1.425991 1.427143 1.427015 1.435986 1.360235   100   b
 akrun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100  a

edited Dec 19 '14 at 14:00

answered Dec 19 '14 at 08:41

Cath

23,906
5
52
86

2

Did you really run your benchmarks on a 4-row dataset? – A5C1D2H2I1M1N2O1R2T1 Dec 19 '14 at 10:36
@AnandaMahto It might be better to compare on a 1e6 dataset. – akrun Dec 19 '14 at 10:50
1

@CathG Please note that in your code, there is no `sort`, which I think might be one factor that changes the timings. – akrun Dec 19 '14 at 12:05
@CathG You mentioned about running 1e6 dataset, do you have the timings? – akrun Dec 19 '14 at 13:09
@akrun, it never ended so I stopped it because it was slowing all other "jobs" and I saw you already displayed results for a pretty big dataset... – Cath Dec 19 '14 at 13:12
@CathG It is not very big dataset, as I don't have time to run these things. – akrun Dec 19 '14 at 13:13

Merge columns in data.frame after removal of duplicate strings

2 Answers2

Benchmarks