12

I am looking for efficient way to join 2 data.frames/data.tables on character column using grep/like/stri_detect condition.

I am able to use sqldf package with join on like, but is pretty slow. On my 2 data.tables (5k rows, 20k rows) it takes about 60 seconds.

My second approach was to use CJ from data.table and after that stri_detect_fixed on 2 columns. This approach is faster(16 seconds) but I am afraid that with growing data it will be impossible to use( it significantly increase ram usage).

I also tried to do it in for loop but it was the slowest one.

Is there any way to do it faster especially in data.table ?

Below I paste my example :

library(stringi)
library(data.table)
library(sqldf)
data1 <- data.table(col1 = paste0(c("asdasd asdasd 768jjhknmnmnj",
"78967ggh","kl00896754","kl008jku"),1:10000))

data2 <- data.table(col2 = paste0(c("mnj", "12345","kl008","lll1"), 1:10000))

system.time(join1 <- data.table(sqldf("select * 
           from data1 a inner join data2 b
                      on a.col1 like '%' || b.col2 || '%'", drv = "SQLite" )))



system.time(kartezjan <- CJ(col1 = data1[,c("col1"), with = F][[1]],
                            col2 = data2[,c("col2"), with = F][[1]],
 unique  = TRUE)[stri_detect_fixed(col1, col2, case_insensitive = FALSE)])
Cyrus
  • 84,225
  • 14
  • 89
  • 153
Kacper
  • 185
  • 9
  • On my machine the sqldf code above took 89.02 seconds and it took 69.06 seconds using `on instr(a.col1, b.col2)` instead of `on ... like ...` and 409.35 seconds with data.table. There is a `library(stringi)` call missing. – G. Grothendieck Mar 15 '16 at 13:16
  • The results of your examples are different. Maybe you should give a simple example and explain what you want to happen in corner cases. (Value in col1 has no match, value in col2 has no match and multiple match possibilities ) Also it would be helpful to know if there are some constraints on the values. – bluefish Mar 15 '16 at 14:21
  • I check it with `instr` but it was slower than with `on ... like ...`. Now I add library and change left join to inner join to make results the same. – Kacper Mar 15 '16 at 15:54
  • In my case there could be values from data1 that can match multiple values in data2 and values in data2 that can match multiple values in data1 – Kacper Mar 15 '16 at 16:11
  • You need to compare all pairs so CJ is required, you can chunk by col1 to fit it into ram if required. The only speed up I see is to use `data1[["col1"]]` to access column as vector. It gave me 1 second speed up. There is an open FR for [Fuzzy join `FJ()` for unstructured search](https://github.com/Rdatatable/data.table/issues/1431). – jangorecki Mar 15 '16 at 21:41
  • I got a slight speedup from 36 to 32 seconds on the `instr` version of the query (i.e. `on instr(a.col1, b.col2) > 0`) by setting an index on col2 with `setkey(data2, col2)`. Indexing doesn't help with the `like` version because the comparison string starts with a wildcard. – Philippe Marchand Jul 31 '16 at 14:57

1 Answers1

2

The sqldf approach is the fastest on my machine for your example data, but here is a faster data.table version in case it helps.

library(data.table)
library(sqldf)

## Example data
v1 <- paste0(c("asdasd asdasd 768jjhknmnmnj", "78967ggh","kl00896754","kl008jku"),
    1:10000)
v2 <- paste0(c("mnj", "12345","kl008","lll1"), 1:10000)

data1 <- data.table(col1=v1, key="col1")
data2 <- data.table(col2=v2, key="col2")


## sqldf version
system.time(
  ans1 <- data.table(sqldf(
    "select * 
    from data1 a inner join data2 b
    on instr(a.col1, b.col2)", drv="SQLite"))
  )

##    user  system elapsed 
##  17.579   0.036  17.654 


## parallelized data.table version
suppressMessages(library(foreach)); suppressMessages(library(doParallel))
cores <- detectCores() ## I've got 4...
clust <- makeForkCluster(cores)
registerDoParallel(clust)

system.time({
  batches <- cores
  data2[, group:=sort(rep_len(1:batches, nrow(data2)))]
  ans2 <- foreach(
    i=1:batches, .combine=function(...) rbindlist(list(...)),
    .multicombine=TRUE, .inorder=FALSE) %dopar% {
      CJ(col1=data1[, col1], col2=data2[group==i, col2])[,
        alike:=col1 %like% col2, by=col2][
          alike==TRUE][, alike:=NULL][]          
    }
})

##    user  system elapsed 
##   0.185   0.229  30.295 

stopCluster(clust)
stopImplicitCluster()

I'm running this on OSX--you may need to tweak the parallelization code for other operating systems. Also, if your actual data is bigger and you're running out of memory, you can try larger batches values.

dnlbrky
  • 9,396
  • 2
  • 51
  • 64