Solution to destructive filtering of a large table

Question

I have a problem where I need to select and save part of a table based on one column then eliminate rows from the source table that match values in one of the columns of the saved table.

I've found that dplyr and data.table are slower than base R and am wondering if I'm doing something wrong here (an anti-pattern I'm not aware of?) or if someone knows a faster solution to this.

I need to scale it up to ~10 million rows in the search df and ~10k iterations of the y_unique search.

Here's a reasonable reproducible example...

(edit: I realized that what I was doing could be achieved through a group filter. Leaving an updated reproducible example with some tweaks from the comments below and my updated solution. -- Note that the original did not include the bind_cols(y_list) detail. In retrospect, I should have included that in this example.)

library(dplyr)
library(data.table)
library(microbenchmark)

microbenchmark(base = {
  for(y_check in y_unique) {
    y_list[[as.character(y_check)]] <- df[df$y == y_check, ]
    df <- df[!df$x %in% y_list[[as.character(y_check)]]$x, ]
  }
  out <- bind_rows(y_list)
}, dplyr = {
  for(y_check in y_unique) {
    y_list[[as.character(y_check)]] <- filter(df, y == y_check)
    df <- df[!df$x %in% y_list[[as.character(y_check)]]$x, ]
  }
  out <- bind_rows(y_list)
}, data.table = {
  for(y_check in y_unique) {
    y_list[[as.character(y_check)]] <- dt[y == y_check]
    dt <- dt[!x %in% y_list[[as.character(y_check)]]$x]
  }
  out <- do.call(rbind, y_list)
}, alternate = {
  df <- group_by(df, x)
  out <- filter(df, y == min(y))
}, times = 10, setup = {
  set.seed(1)
  df <- data.frame(x = sample(1:1000, size = 1000, replace = TRUE),
                   y = sample(1:100, size = 1000, replace = TRUE))
  dt <- data.table(df)
  y_unique <- sort(unique(df$y))
  y_list <- setNames(rep(list(list()), length(y_unique)), y_unique)
})

I get:

Unit: milliseconds
       expr        min        lq       mean     median        uq        max neval
       base  12.939135  13.22883  13.623098  13.500897  13.95468  14.517167    10
      dplyr  41.517351  42.22595  50.041123  45.199978  61.33194  65.927611    10
 data.table 228.014360 233.98309 248.281965 240.172383 263.39943 287.706941    10
  alternate   3.310031   3.42016   3.745013   3.454537   4.17488   4.497455    10

On my real data, I get more or less the same. The base is 2+ times faster than dplyr and data.table is... slow. Any ideas?

If you're worried about speed, I would strongly recommend testing at scale, or at least closer to scale. In many cases `data.table` is slower at small scale (when the difference is in milliseconds) and faster at large scale (when the difference is in seconds or minutes---i.e., where it counts). — Gregor Thomas, Jul 06 '19 at 01:31
`data.table` can also be context specific. Much of the data.table time in this benchmark is probably tied up in the data.table conversion. But if you're doing more than this one transformation, converting to data.table will speed up *all* your operations. (And if you're data is read in from a file, using `data.table::fread` will produce a data.table more quickly than any other import method will produce a data frame, so holistically it may not make sense to include `data.table()` in the benchmark — Gregor Thomas, Jul 06 '19 at 01:36
Also, see `?split` and `?split.data.table`, and `?dplyr::group_split`. [This question is highly relevant to you](https://stackoverflow.com/q/18527051/903061). — Gregor Thomas, Jul 06 '19 at 01:43
Hmmm.. perhaps I've misunderstood the iterative nature? Do you expect the filtered `df` to have anything in it at the end? Or is the point filtering it for the next iteration? — Gregor Thomas, Jul 06 '19 at 01:50
Adding to what @Gregor said, coercion will be an overhead not only during `data.frame` to `data.table` conversion, but also during searching, because you're doing `as.character(y_unique)` and then using it to search integers. You should leave it as integer and use `y_list[[as.character(y_check)]]`. Moreover, `data.table` is probably not made for this kind of scenarios, check [this question and its answer](https://stackoverflow.com/questions/56348397/improving-data-table-subsetting-performance). — Alexis, Jul 06 '19 at 14:51
Thanks for all the ideas. At the end of the day I had to rethink this and use the structure of my data to allow a more efficient approach. I'll update my question to reflect that. — David Blodgett, Jul 06 '19 at 14:53
Really sorry. This is an example of focusing on optimizing a local bit of code when I should have been thinking a bit more broadly. I failed to include a `bind_rows(y_list)` call to bring the set of `data.fame`s together. I'll edit once more and make each block of code do the same thing. — David Blodgett, Jul 07 '19 at 14:04

chinsoon12 · Answer 1 · 2019-07-08T06:25:25.703

A few options using join (around 13s with any join method for the actual dimensions):

DT <- copy(dt)
setorder(DT, y, x)
DT[DT[.(unique(x)), on=.(x), .(y=first(y)), by=.EACHI], on=.(x,y)]

or if original ordering matters:

DT2 <- copy(dt)
setorder(DT2[, rn := .I], y, x)
dt[sort(DT2[.(unique(x)), on=.(x), rn[y==first(y)], by=.EACHI]$V1)]

and also using the min mentioned in OP:

DT0[, rn := .I]
dt[DT0[.(unique(x)), on=.(x), rn[y==min(y)], by=.EACHI][order(V1), V1]]

timing code:

base <- function() {
    for(y_check in y_unique) {
        y_list[[as.character(y_check)]] <- df[df$y == y_check, ]
        df <- df[!df$x %in% y_list[[as.character(y_check)]]$x, ]
    }
    do.call(rbind, y_list)
} #base

mtd0 <- function() {
    for(y_check in y_unique) {
        y_list[[as.character(y_check)]] <- dt[y == y_check]
        dt <- dt[!x %in% y_list[[as.character(y_check)]]$x]
    }
    out <- rbindlist(y_list)
} #mtd0

join_mtd <- function() {
    setorder(DT, y, x)
    dt[DT[.(unique(x)), on=.(x), .(y=first(y)), by=.EACHI], on=.(x,y)]
} #join_mtd

join_mtd2 <- function() {
    setorder(DT2[, rn := .I], y, x)
    dt[sort(DT2[.(unique(x)), on=.(x), rn[y==first(y)], by=.EACHI]$V1)]
} #join_mtd2

join_mtd3 <- function() {
    DT0[, rn := .I]
    dt[DT0[.(unique(x)), on=.(x), rn[y==min(y)], by=.EACHI][order(V1), V1]]
} #join_mtd3

bench::mark(base(), data.table_0=mtd0(), 
    jm=join_mtd(), jm2=join_mtd2(), jm3=join_mtd2(), check=FALSE)

checks:

baseans <- setDT(base())
data.table_0 <- mtd0()
ordbase <- setorder(copy(baseans), y, x)
jm <- join_mtd()
jm2 <- join_mtd2()
jm3 <- join_mtd3()

identical(baseans, data.table_0)
#[1] TRUE
identical(ordbase, setorder(jm, y, x))
#[1] TRUE
identical(ordbase, setorder(jm2, y, x))
#[1] TRUE
identical(ordbase, setorder(jm3, y, x))
#[1] TRUE

timings:

# A tibble: 5 x 14
  expression        min     mean   median      max `itr/sec` mem_alloc  n_gc n_itr total_time result                   memory                time    gc            
  <chr>        <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list>                   <list>                <list>  <list>        
1 base()         38.59s   38.59s   38.59s   38.59s    0.0259    27.3GB   308     1     38.59s <data.frame [632,329 x ~ <Rprofmem [43,206 x ~ <bch:t~ <tibble [1 x ~
2 data.table_0   24.65s   24.65s   24.65s   24.65s    0.0406      14GB   159     1     24.65s <data.table [632,329 x ~ <Rprofmem [72,459 x ~ <bch:t~ <tibble [1 x ~
3 jm              1.28s    1.28s    1.28s    1.28s    0.779       75MB     7     1      1.28s <data.table [632,329 x ~ <Rprofmem [2,418 x 3~ <bch:t~ <tibble [1 x ~
4 jm2             1.44s    1.44s    1.44s    1.44s    0.696     62.5MB     9     1      1.44s <data.table [632,329 x ~ <Rprofmem [1,783 x 3~ <bch:t~ <tibble [1 x ~
5 jm3             1.57s    1.57s    1.57s    1.57s    0.636     62.5MB     9     1      1.57s <data.table [632,329 x ~ <Rprofmem [178 x 3]>  <bch:t~ <tibble [1 x ~

data:

library(data.table)
library(bench)

set.seed(1L)
nr <- 10e6/10
ni <- 10e3/10
df <- data.frame(x = sample(nr, size = nr, replace = TRUE),
    y = sample(ni, size = nr, replace = TRUE))
dt <- data.table(df)
DT0 <- copy(dt)
DT <- copy(dt)
DT2 <- copy(dt)

y_unique <- sort(unique(df$y))
y_list <- setNames(rep(list(list()), length(y_unique)), y_unique)

Solution to destructive filtering of a large table

1 Answers1