I am looking for efficient way to join 2 data.frames/data.tables on character column using grep/like/stri_detect condition.
I am able to use sqldf package with join on like, but is pretty slow. On my 2 data.tables (5k rows, 20k rows) it takes about 60 seconds.
My second approach was to use CJ from data.table and after that stri_detect_fixed on 2 columns. This approach is faster(16 seconds) but I am afraid that with growing data it will be impossible to use( it significantly increase ram usage).
I also tried to do it in for loop but it was the slowest one.
Is there any way to do it faster especially in data.table ?
Below I paste my example :
library(stringi)
library(data.table)
library(sqldf)
data1 <- data.table(col1 = paste0(c("asdasd asdasd 768jjhknmnmnj",
"78967ggh","kl00896754","kl008jku"),1:10000))
data2 <- data.table(col2 = paste0(c("mnj", "12345","kl008","lll1"), 1:10000))
system.time(join1 <- data.table(sqldf("select *
from data1 a inner join data2 b
on a.col1 like '%' || b.col2 || '%'", drv = "SQLite" )))
system.time(kartezjan <- CJ(col1 = data1[,c("col1"), with = F][[1]],
col2 = data2[,c("col2"), with = F][[1]],
unique = TRUE)[stri_detect_fixed(col1, col2, case_insensitive = FALSE)])