1

I want to filter rows and store the original row number (of the source data) in a data.table column.

I know that .I has a syntax variant for that (see https://stackoverflow.com/a/23586059/4468078) but this requires me to filter twice what I want to avoid:

DT <- mtcars
setDT(DT)
row.numbers <- DT[, .I[ gear > 4]]
# > row.numbers
# [1] 27 28 29 30 31
DT[row.numbers, .(row.numbers, gear)]
#    row.numbers gear
# 1:          27    5
# 2:          28    5
# 3:          29    5
# 4:          30    5
# 5:          31    5

If I use the normal .I syntax it returns the row number of the subset, not from the original data:

DT[gear > 4, .(.I, gear)]

   I gear
1: 1    5
2: 2    5
3: 3    5
4: 4    5
5: 5    5

Any ideas for an easier/optimized solution?

zx8754
  • 52,746
  • 12
  • 114
  • 209
R Yoda
  • 8,358
  • 2
  • 50
  • 87
  • 1
    It may be possible, but it is my understanding that the data.table is subset in the `i` argument prior to any calculations in the `j` argument. – lmo Jul 26 '17 at 16:10

4 Answers4

2

Edit 2: Added the w2 variant of @Frank...

To supplement to accepted answer of @UweBlock I have done some benchmarks which I want to show here to share the results:

library(data.table)
library(microbenchmark)
# size: about 800 MB
DT <- data.table(x = sample(1000, 1E8, replace = TRUE), y = sample(1000, 1E8, replace = TRUE))
LIMIT <- 500

microbenchmark(row.filter = {
  row.numbers <- DT[, .I[x > LIMIT]]
  res <- DT[row.numbers, .(row.numbers, x, y)]
},
chaining = {
  res <- DT[, row.number := .I][x > LIMIT, .(row.number, x, y)]
},
w2 = {
  w = DT[x > LIMIT, which = TRUE ]
  DT[w, c("x","y")][, w := w ]
},
times = 20)

The result is:

Unit: seconds
       expr      min       lq     mean   median       uq      max neval cld
 row.filter 2.039627 2.152165 2.290105 2.284775 2.381632 2.652203    20   b
   chaining 2.032791 2.272493 2.369759 2.359630 2.472028 2.777191    20   b
         w2 1.104414 1.194826 1.274428 1.257893 1.311050 1.557225    20  a 

Edit 1: To compare the impact of filter selectivity:

For LIMIT <- 100:

Unit: seconds
       expr      min       lq     mean   median       uq      max neval cld
 row.filter 3.254134 3.638193 4.053991 3.865599 4.432278 5.337939    20   b
   chaining 3.005504 3.874443 4.116179 4.069974 4.391666 4.994020    20   b
         w2 1.289617 1.588608 1.965523 1.962185 2.294457 2.859689    20  a 

And for LIMIT <- 900:

Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval cld
 row.filter 900.9504 905.0694 914.9406 907.5211 916.2071  964.6856    20  b 
   chaining 927.1630 932.0981 965.8222 970.9336 981.5885 1030.6396    20   c
         w2 607.0091 609.8028 620.5582 612.0490 615.2337  669.9706    20 a  
R Yoda
  • 8,358
  • 2
  • 50
  • 87
  • Nice, especially to check for filter selectivity. It seems, that for less rows (shorter `row.number` vector, filtering with `x > 900`) the `row.filter` approach is about 1.5% faster while for a longer `row.number` vectors, `chaining` is about 0.5% faster. – Uwe Jul 26 '17 at 17:04
  • Here's another benchmark, though I didn't have the patience to run it more than 3x: https://chat.stackoverflow.com/transcript/message/38334338#38334338 – Frank Jul 26 '17 at 17:17
  • 1
    And... `w2 = { w = DT[x > th, which = TRUE ]; DT[w, c("x","y")][, w := w ] }` is proving the fastest for me. (Could use `setcolorder` on the result if you want row numbers first.) – Frank Jul 26 '17 at 17:27
  • I am updating the benchmark and try to understand why it is faster (using `which`, different chaining order?) – R Yoda Jul 26 '17 at 18:08
  • @Frank If you post an answer I will change my vote (sorry Uwe, I was too fast) – R Yoda Jul 26 '17 at 18:25
  • I'm not really sure why it's faster. Creating the intermediate object, `w`, should be slow in some cases, I guess. – Frank Jul 26 '17 at 19:26
  • I was pretty sure, @Frank would come up with an even better/faster `data.table` solution :-) So, no worries. – Uwe Jul 27 '17 at 05:09
2

Why another benchmark?

  1. Frank mentioned in his comment that there might be a speed up by switching from .(rn, gear) to c("rn", "gear") but didn't benchmark that separately.
  2. In R yoda's benchmark, the sample data are of type integer but LIMIT <- 500 is of type double. data.table occasionally warns about type conversions so I wonder what effect on performance the type conversion may have in this case.

What to benchmark?

Up to now, 3 answers have been provided which make up five code variants:

Unfortunately, I couldn't get the row.filter to work in a SE version.

Which parameters are used?

  • Problem size (number of rows): 102, 103, ..., 108
  • Different values for LIMIT: 100, 500, 900
  • Type of LIMIT: integer, double to test the effect of type conversion

The number of repetitions is computed from the problem size with a minimum of 3 runs and a maximum of 100 runs.

Results

Type conversion does cost about 4% (median) to 9% (mean) of performance. So it does matter if you write LIMIT <- 500 or LIMIT <- 500L using L to indicate an integer constant.

enter image description here

The performance penalty for using non-standard evaluation is much higher: NSE needs on average more than 50% more time than SE - for both approaches.
(Note that the chart below only shows results for type integer)

enter image description here

The chart below for limit 500 and type integer indicates that the SE variants are faster than their NSE counterparts for all problem sizes. Interestingly, chaining_se seems to have a slight advantage over which_se for smaller problem sizes up to 5000 rows while for problem sizes above 5 M rows which_se is clearly faster.

enter image description here

By request, here is a table showing the timings in ms for above chart:

dcast(bm_med[limit == 500L & type == "int"][
  , expr := forcats::fct_reorder(factor(expr), -time)],
  expr ~ n_rows, fun.aggregate = function(x) max(x/1E6), value.var = "time")
           expr       100      1000     10000    1e+05    1e+06    1e+07    1e+08
1: chaining_nse 0.8189745 0.8493695 1.0115405 2.870750 22.34469 441.1621 2671.179
2:   row.filter 0.7693225 0.7972635 0.9622665 2.677807 21.30861 247.3984 2677.495
3:    which_nse 0.8486145 0.8690035 1.0117295 2.620980 18.39406 219.0794 2341.990
4:  chaining_se 0.5299360 0.5582545 0.6454755 1.700626 12.48982 166.0164 2049.904
5:     which_se 0.5894045 0.6114935 0.7040005 1.624166 13.00125 130.0718 1289.050

Benchmark code

library(data.table)
library(microbenchmark)
run_bm <- function(n_rows, limit = 500L, type = "int") {
  set.seed(1234L)
  DT <- data.table(x = sample(1000, n_rows, replace = TRUE), 
                   y = sample(1000, n_rows, replace = TRUE))
  LIMIT <- switch(type,
                  int = as.integer(limit),
                  dbl = as.double(limit))
  times <- round(scales::squish(sqrt(1E8 / n_rows) , c(3L, 100L)))
  cat("Start run:", n_rows, limit, type, times, "\n")
  microbenchmark(row.filter = {
    row.numbers <- DT[, .I[x > LIMIT]]
    DT[row.numbers, .(row.numbers, x, y)]
  },
  chaining_nse = {
    DT[, row.number := .I][x > LIMIT, .(row.number, x, y)]
  },
  chaining_se = {
    DT[, row.number := .I][x > LIMIT, c("row.number", "x", "y")]
  },
  which_nse = {
    row.numbers <- DT[x > LIMIT, which = TRUE ]
    DT[row.numbers, .(x, y)][, row.numbers := row.numbers ][]
  },
  which_se = {
    row.numbers <- DT[x > LIMIT, which = TRUE ]
    DT[row.numbers, c("x", "y")][, row.numbers := row.numbers][]
  },
  times = times)
}
# parameter
bm_par <- CJ(n_rows = 10^seq(2L, 8L, 1L), 
             limit = seq(100L, 900L, 400L), 
             type = c("int", "dbl"))
# run the benchmarks
bm_raw <- bm_par[, run_bm(n_rows, limit, type), by = .(n_rows, limit, type)]
# aggregate results
bm_med <- bm_raw[, .(time = median(time)), by = .(n_rows, limit, type, expr)]

Graphics code

library(ggplot2)

# chart 1
ggplot(
  dcast(bm_med, n_rows + limit + expr ~ type, value.var = "time")[
    , ratio := dbl / int - 1.0] #[limit == 500L]
) + 
  aes(n_rows, ratio, colour = expr) +
  geom_point() + 
  geom_line() + 
  facet_grid(limit ~ expr) + 
  scale_x_log10(labels = function(x) scales::math_format()(log10(x))) +
  scale_y_continuous(labels = scales::percent) + 
  coord_cartesian(ylim = c(-0.1, 0.5)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  ggtitle("Performance loss due to type conversion") +
  ylab("Relative computing time dbl vs int") + 
  xlab("Number of rows (log scale)")
ggsave("p2.png")

# chart 2
ggplot(
  dcast(bm_med[, c("code", "eval") := tstrsplit(expr, "_")][!is.na(eval)], 
        n_rows + limit + type + code ~ eval, value.var = "time")[
          , ratio := nse / se - 1.0][type == "int"]
) + 
  aes(n_rows, ratio, colour = code) +
  geom_point() + 
  geom_line() + 
  facet_grid(limit  + type ~ code) + 
  scale_x_log10(labels = function(x) scales::math_format()(log10(x))) +
  scale_y_continuous(labels = scales::percent) + 
  geom_hline(yintercept = 0) +
  theme_bw() +
  ggtitle("Performance loss due to non standard evaluation") +
  ylab("Relative computing time NSE vs SE") + 
  xlab("Number of rows (log scale)")
ggsave("p3.png")

# chart 3
ggplot(bm_med[limit == 500L][type == "int"]) + 
  aes(n_rows, time/1E6, colour = expr) +
  geom_point() + 
  geom_smooth(se = FALSE) + 
  facet_grid(limit ~ type) +
  facet_grid(type ~ limit) +
  scale_x_log10(labels = function(x) scales::math_format()(log10(x))) +
  scale_y_log10(labels = function(x) scales::math_format()(log10(x))) +
  theme_bw() +
  ggtitle("Benchmark results (log-log scale)") +
  ylab("Computing time in ms (log scale)") + 
  xlab("Number of rows (log scale)")
ggsave("p1.png")
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • WOW! Regarding NSE: For details see http://adv-r.had.co.nz/Computing-on-the-language.html – R Yoda Jul 28 '17 at 05:41
  • 1
    @RYoda Thank you for supplementing the link! Another good read might be the [vignette to Hadley's `lazyeval` package](https://cran.r-project.org/web/packages/lazyeval/vignettes/lazyeval.html). – Uwe Jul 28 '17 at 05:53
1

You can add a column of row numbers before filtering:

library(data.table)
data.table(mtcars)[, rn := .I][gear > 4, .(rn, gear)]
   rn gear
1: 27    5
2: 28    5
3: 29    5
4: 30    5
5: 31    5

Benchmarking

Just a quick benchmark with the mtcars data set (32 rows) which is far to small but the focus here is on overhead.

microbenchmark::microbenchmark(
  copy = DT <- data.table(mtcars),
  ryoda = {
    DT <- data.table(mtcars)
    row.numbers <- DT[, .I[ gear > 4]]
    DT[row.numbers, .(row.numbers, gear)]
  },
  uwe = {
    DT <- data.table(mtcars)
    DT[, rn := .I][gear > 4, .(rn, gear)]
  },
  times = 1000L
)
Unit: microseconds
  expr      min       lq     mean   median       uq       max neval cld
  copy  691.710  727.192  803.235  749.385  785.428 15989.293  1000 a  
 ryoda 1821.869 1883.479 2001.653 1930.213 2011.124  6650.497  1000  b 
   uwe 1860.288 1934.191 2053.004 1987.927 2077.370  5908.892  1000   c

Note that each benchmark run is started with a fresh copy of DT because one of the codes is modifying DT in place (using :=).

Here, it seems that there is a penalty of 50 to 60 microseconds with chaining for the tiny 32 rows sample data set. R Yoda's benchmark results with a large data set of 800 M rows show differences of about 1% in both directions depending on the number of filtered rows, i.e., the length of row.numbers.

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Yes, chaining is an option, but may cost performance too (even though it looks minor since only a new column/vector is appended). But the filtering is done only once, good answer! – R Yoda Jul 26 '17 at 16:17
  • Oh, that's interesting. I'm aware that piping (`%>%`) may cost performance but I wasn't aware that chaining may cost performance as well (I trust that the `data.table` guys are pretty much focused on efficiency). Do you have an example of this? – Uwe Jul 26 '17 at 16:22
  • 1
    Lucky strike, I have found it again: http://r.789695.n4.nabble.com/Memory-usage-of-data-table-chaining-td4703587.html (it is related to memory usage, but memory usage may cost performance) – R Yoda Jul 26 '17 at 16:30
  • 1
    Thank you very much for the link. I wonder how this would be answered today, 2 and a half years and a lot of optimisations of `data.table` later. – Uwe Jul 26 '17 at 17:09
  • 1
    Absolutely right and would be worth a new SO question since @arun was participating in the linked discussion. I thought there is already a SO question, answer or comment to brought me to this link but I don't find it ATM. – R Yoda Jul 26 '17 at 17:10
  • Chaining doesn't cost you in this case, since you don't "materialize" a new table after the first step, instead just adding by reference to an existing table, right? I guess yours might speed up by switching from `.(rn, gear)` to `c("rn", "gear")` but didn't benchmark that separately. – Frank Jul 26 '17 at 19:31
  • @Frank I've added another [benchmark](https://stackoverflow.com/a/45347181/3817004) comparing both variants. It's more than 50% speed gain! – Uwe Jul 27 '17 at 09:52
1

A little bit faster for the example in @RYoda's answer:

w = DT[x > LIMIT, which = TRUE ]
DT[w, c("x","y")][, w := w ]

To change the order of the columns in the result, setcolorder should work, taking almost no time.

Frank
  • 66,179
  • 8
  • 96
  • 180