1

A data.table with two columns (3-grams and their counts) which has a key set on the ngrams column. The 3-grams are a single character vector of three words separated by spaces.

set.seed(20182)

create.ngrams <- function(){
        w1 <- paste(sample(letters[1:5], 3, T), collapse = '')
        w2 <- paste(sample(letters[1:5], 3, T), collapse = '')
        w3 <- paste(sample(letters, 5, T), collapse = '')

        ngram <- paste(c(w1, w2, w3), collapse = " ")
        return(ngram)
}

dt <- data.table(ngrams = replicate(100000, create.ngrams()), N = sample.int(100, 100000, replace=T))

dt[ngrams %like% '^ada cab \\.*']

What I need to derive is, given a 2-gram, how many unique 3-grams appear in the 3-gram table with the 2-gram as the stem? The approach so far is to filter on the 3-gram table and getting a row count using regex expressions and the data.table %like% function. Unfortunately, the documentation states that like doesn't make use of the table key.

Note: Current implementation does not make use of sorted keys.

This slows the filtering down considerably:

dt[ngrams %like% '^ada cab \\.*']

          ngrams  N
1: ada cab jsfzb 33
2: ada cab rbkqz 43
3: ada cab oyohg 10
4: ada cab dahtd 87
5: ada cab qgmfb  8
6: ada cab ylyfl 13
7: ada cab izeje 83
8: ada cab fukov 12

microbenchmark(dt[ngrams %like% '^ada cab \\.*']))

Unit: milliseconds
                                expr     min      lq     mean  median       uq     max neval
 dt[ngrams %like% "^ada cab \\\\.*"] 22.4061 23.9792 25.89883 25.0981 26.88145 34.7454   100

On the actual table I'm working with (nrow = 46856038), the performance is too slow to achieve the task I have:

Unit: seconds
                              expr      min       lq     mean   median       uq      max neval
 t[ngrams %like% "^on the \\\\.*"] 10.48471 10.57198 11.27199 10.77015 10.94827 17.42804   100

Anything I could do to improve performance? I tried working with dplyr a bit, but the gains didn't appear to be significant.

Conner M.
  • 1,954
  • 3
  • 19
  • 29
  • 1
    venturing a guess. what kind of memory do you have? is it feasible for you to split up the 3-grams into 3 columns and then key those 3 columns and search for the 2-gram in either columns 1&2 or 2&3? – chinsoon12 Apr 27 '20 at 22:51
  • Using your `set.seed` I get a different starting condition. Is there something else here? (R-3.5.3, win10) (Also, you have some errant right-parens in both the first and second code chunks.) – r2evans Apr 28 '20 at 02:25
  • @r2evans win10 and R version 3.6.1, but wasn't aware that these could impact seed starting condition. – Conner M. Apr 28 '20 at 03:08
  • I don't know that they do, but with R-4.0 released recently, I thought if you were using that version that this could be a difference. \*shrug\* – r2evans Apr 28 '20 at 03:14

1 Answers1

1

Are you able to go with fixed= patterns? If you prepend a space to all ngrams, it gives you a virtual "word-boundary", allowing you to do a much faster pattern:

dt[, ngrams1 := paste0(" ", ngrams)]
dt
#                ngrams  N        ngrams1
#      1: dcd aee vxfba 99  dcd aee vxfba
#      2: cad bec alsmv 92  cad bec alsmv
#      3: ebe edd zbogd 90  ebe edd zbogd
#      4: aac ace miexa 26  aac ace miexa
#      5: aea cda ppyii 67  aea cda ppyii
#     ---                                
#  99996: cca bbc xaezc 58  cca bbc xaezc
#  99997: ebc cae ktacb 95  ebc cae ktacb
#  99998: bed abe dpjmc 92  bed abe dpjmc
#  99999: dde cdb frkfz 79  dde cdb frkfz
# 100000: bed bce ydawa 52  bed bce ydawa

dt[ngrams %like% '^ada cab \\.*']
#           ngrams  N        ngrams1
# 1: ada cab qbbiw 22  ada cab qbbiw
# 2: ada cab kpejz 16  ada cab kpejz
# 3: ada cab lighh  4  ada cab lighh
# 4: ada cab rxpmc 64  ada cab rxpmc

dt[grepl(' ada cab ', ngrams1, fixed = TRUE),]
#           ngrams  N        ngrams1
# 1: ada cab qbbiw 22  ada cab qbbiw
# 2: ada cab kpejz 16  ada cab kpejz
# 3: ada cab lighh  4  ada cab lighh
# 4: ada cab rxpmc 64  ada cab rxpmc

In a benchmark, a fixed pattern is 3-4 times as fast:

microbenchmark::microbenchmark(
  a = dt[ngrams %like% '^ada cab \\.*'],
  b = dt[grepl('^ada cab', ngrams),],
  c = dt[ngrams1 %flike% ' ada cab ', ],
  d = dt[grepl(' ada cab ', ngrams1, fixed = TRUE),]
)
# Unit: milliseconds
#  expr       min        lq      mean    median        uq       max neval
#     a 20.299101 21.364401 22.088702 21.832000 22.444351 25.403801   100
#     b 20.605501 21.648101 22.656212 22.382001 23.384151 26.330201   100
#     c  4.337301  4.872151  5.265142  5.125251  5.500951  9.646201   100
#     d  4.301901  4.860501  5.221697  5.102000  5.465402  7.339400   100

This does not work if the pattern deviates from 3-3-5 (e.g., if you have more 3s, where this might accidentally match other than the first couple 3s).

r2evans
  • 141,215
  • 6
  • 77
  • 149