4

Perhaps this is already answered and I missed it, but it's hard to search.

A very simple question: Why is dt[,x] generally a tiny bit faster than dt$x?

Example:

dt<-data.table(id=1:1e7,var=rnorm(1e6))

test<-microbenchmark(times=100L,
                     dt[sample(1e7,size=200000),var],
                     dt[sample(1e7,size=200000),]$var)

test[,"expr"]<-c("in j","$")

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval
    $ 14.28863 15.88779 18.84229 17.23109 18.41577 53.63473   100
 in j 14.35916 15.97063 18.87265 17.99266 18.37939 54.19944   100

I might not have chosen the best example, so feel free to suggest something perhaps more poignant.

Anyway, evaluating in j is faster at least 75% of the time (though there appears to be a fat upper tail as the mean is higher; side note, it would be nice if microbenchmark could spit me out some histograms).

Why is this the case?

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • 2
    I see essentially no difference there. You are grabbing a column, which takes no time...and then doing this other thing -- sampling and subsetting -- which takes a lot of time. I don't understand why. – Frank Apr 29 '15 at 23:30
  • 2
    0.03 milliseconds mean difference? We're at the thin edge of the wedge here surely? – thelatemail Apr 29 '15 at 23:34
  • mean isn't as useful here. 25th, 50th, and 75th percentiles are all more distinguished and in the same direction. – MichaelChirico Apr 29 '15 at 23:36
  • @Frank, yes the big fixed cost is the same, but the differential is there anyway. but you're right that it's hard to compare the scale of the difference when the times are mainly driven by the subsetting. – MichaelChirico Apr 29 '15 at 23:39
  • linking [this](http://stackoverflow.com/questions/41539202/changing-factor-levels-on-a-column-with-setattr-sensitive-for-how-column-is-crea) question relevant to users wondering: should I use `[` or `$`? – MichaelChirico Jan 09 '17 at 00:24
  • @MichaelChirico Thanks for linking. Interesting Q (++). – Henrik Jan 09 '17 at 01:43
  • @Henrik thanks! yours is much more interesting IMO. in retrospect it's sort of clear -- both of my calls are running `[.data.table` (with all its overhead), but the former evaluates `j` within that call. That said mnel's answer is great. – MichaelChirico Jan 09 '17 at 01:55

1 Answers1

7

With j, you are subsetting and selecting within a call to [.data.table.

With $ (and your call), you are subsetting within [.data.table and then selecting with $

You are in essence calling 2 functions not 1, thus there is a neglible difference in timing.

In your current example you are calling `sampling(1e,200000) each time.

For comparison to return identical results

dt<-data.table(id=1:1e7,var=rnorm(1e6))
setkey(dt, id)
ii <- sample(1e7,size=200000)


microbenchmark("in j" = dt[.(ii),var], "$"=dt[.(ii)]$var, '[[' =dt[.(ii)][['var']], .subset2(dt[.(ii)],'var'), dt[.(ii)][[2]], dt[['var']][ii], dt$var[ii], .subset2(dt,'var')[ii] )
Unit: milliseconds
                       expr       min        lq      mean    median        uq       max neval cld
                       in j 39.491156 40.358669 41.570057 40.860342 41.485622 70.202441   100   b
                          $ 39.957211 40.561965 41.587420 41.136836 41.634584 69.928363   100   b
                         [[ 40.046558 40.515480 42.388432 41.244444 41.750946 72.224827   100   b
 .subset2(dt[.(ii)], "var") 39.772781 40.564077 41.561271 41.111630 41.635489 69.252222   100   b
             dt[.(ii)][[2]] 40.004300 40.513669 41.682526 40.927503 41.492866 72.986995   100   b
            dt[["var"]][ii]  4.432346  4.546898  4.946219  4.623416  4.755777 31.761115   100  a 
                 dt$var[ii]  4.440496  4.539502  4.668361  4.597457  4.729214  5.425125   100  a 
    .subset2(dt, "var")[ii]  4.365939  4.508261  4.660435  4.598815  4.703858  6.072289   100  a 
mnel
  • 113,303
  • 27
  • 265
  • 254
  • So the difference between `$` and `j` is actually quite small compared to your last options, where we pull out the `var` vector first and then subset, which appear to be 10x faster. – MichaelChirico Apr 30 '15 at 00:29
  • @MichaelChirico It's not 10x faster; it's 50% faster in my run of the simulation. The "in j" and "$" in mnel's benchmark are not the same ones you used... note the `setkey` and `.(ii)` vs your `ii`. – Frank Apr 30 '15 at 00:52