0

This is a rather general question, not a particular one on an existing example. I have a very large data.table object, 130 million rows. If I want to filter rows based on an OR condition ("|"), it takes forever, unlike the AND condition, that is executed in no time, but a code like

df[start %chin% stops$names | end %chin% stops$names]

is impossible to wait for. I must be missing something since data.table ought to work very fast. Thank you in advance.

An example data is the following:

df <- setDT(tibble(start=c("stop1", "stop2", "stop3", "stop1", "stop4"),
               end=c("stop13", "stop2", "stop15", "stop2", "stop3"),
               via=c("\\stop3\\", "-", "-", "\\stop4, stop5\\", "-"),
               date=c(2022-01-13, 2022-04-05, 2022-03-04, 2022-04-03, 2022-01-18)))

stops <- setDT(tibble(names=c("stop5", "stop6", "stop7", "stop10"),
                       line=c("1", "23", "450a", "2")))
  • 1
    Please add a [MWE](https://stackoverflow.com/help/minimal-reproducible-example). – Julian Mar 22 '23 at 09:20
  • It seems like a general question but I just had a quick try at generating a `data.table` of two columns, each containing 10 million random strings. I then filtered using two `%chin%` statements with `&` and `|`, respectively, against a vector of 10 million random strings. Both took about 3.5 seconds. What you are experiencing may be caused by something specific to your data or your environment, so it would be helpful to post a little more info. – SamR Mar 22 '23 at 09:33
  • 1
    Your example data has no overlaps, so anything we can suggest is unprovable. Please spend a moment on your sample data to make sure that matches _do occur_. Also, `2022-01-13` is not a string or a date, it is `2008` (2022 minus 1 minus 13). If that's at all important, perhaps you should add quotes `"2022-01-13"` in your sample data. – r2evans Mar 22 '23 at 13:31
  • 2
    Further, I'm not sure where to go with this question. I expanded your two sample frames to be 5Mi rows (`df`) and 40K rows (`stops`), and _yes_, the performance of `&` is faster. It will almost always be faster, since by the nature of subsetting, it must copy all of the data to a new object. With `|`, it seems unlikely that you would not match many more rows, which takes more time to copy into a new object. I think what you're hitting here is a fundamental CS issue: copying large objects is expensive. – r2evans Mar 22 '23 at 13:41
  • 1
    Perhaps you should consider what you intend to do with that frame, and if you really need it in a new object. If you are trying to discard rows ("permanently"), then the best approach theoretically is to delete rows by-reference (https://stackoverflow.com/q/32882768/3358272), but that's [#635](https://github.com/Rdatatable/data.table/issues/635) (not yet implemented). If you are having difficulty in general dealing with data this large, perhaps a not-in-memory data solution could work (e.g., `arrow` or a database), assuming you will be filtering/subsetting a lot. – r2evans Mar 22 '23 at 13:45

1 Answers1

1
n     <- 10e6
dfbig <- df[sample(.N, size = n, replace = TRUE)]

microbenchmark::microbenchmark(
  dfbig[start %chin% stops$names | end %chin% stops$names],
  dfbig[start %chin% stops$names][end %chin% stops$names]
)
# Unit: milliseconds
#                                                      expr        min       lq       mean     median         uq      max neval
#  dfbig[start %chin% stops$names | end %chin% stops$names] 109.113300 110.8053 119.439176 111.941651 116.210701 249.6909   100
#   dfbig[start %chin% stops$names][end %chin% stops$names]   1.614001   1.7166   2.609306   2.109502   2.239902  56.5259   100
s_baldur
  • 29,441
  • 4
  • 36
  • 69