A data.table with two columns (3-grams and their counts) which has a key set on the ngrams column. The 3-grams are a single character vector of three words separated by spaces.
set.seed(20182)
create.ngrams <- function(){
w1 <- paste(sample(letters[1:5], 3, T), collapse = '')
w2 <- paste(sample(letters[1:5], 3, T), collapse = '')
w3 <- paste(sample(letters, 5, T), collapse = '')
ngram <- paste(c(w1, w2, w3), collapse = " ")
return(ngram)
}
dt <- data.table(ngrams = replicate(100000, create.ngrams()), N = sample.int(100, 100000, replace=T))
dt[ngrams %like% '^ada cab \\.*']
What I need to derive is, given a 2-gram, how many unique 3-grams appear in the 3-gram table with the 2-gram as the stem? The approach so far is to filter on the 3-gram table and getting a row count using regex expressions and the data.table %like%
function. Unfortunately, the documentation states that like
doesn't make use of the table key.
Note: Current implementation does not make use of sorted keys.
This slows the filtering down considerably:
dt[ngrams %like% '^ada cab \\.*']
ngrams N
1: ada cab jsfzb 33
2: ada cab rbkqz 43
3: ada cab oyohg 10
4: ada cab dahtd 87
5: ada cab qgmfb 8
6: ada cab ylyfl 13
7: ada cab izeje 83
8: ada cab fukov 12
microbenchmark(dt[ngrams %like% '^ada cab \\.*']))
Unit: milliseconds
expr min lq mean median uq max neval
dt[ngrams %like% "^ada cab \\\\.*"] 22.4061 23.9792 25.89883 25.0981 26.88145 34.7454 100
On the actual table I'm working with (nrow = 46856038), the performance is too slow to achieve the task I have:
Unit: seconds
expr min lq mean median uq max neval
t[ngrams %like% "^on the \\\\.*"] 10.48471 10.57198 11.27199 10.77015 10.94827 17.42804 100
Anything I could do to improve performance? I tried working with dplyr
a bit, but the gains didn't appear to be significant.