5

In R, an advantage of using integers over doubles is object size. I'm somewhat surprized not to find such an advatage in performance. My naive expectation was that operating with less information would be more efficient.

My work does a lot of number crunching and I wanted to decide on weather consistently using integers or double type in my data.tables and functions.

I'm aware of integer overflow, which is not an issue here on my specific variables.

I am talking about variables which nature is integer. They never become fractions/decimals. But they still need to be transposed (using R's operators), but again, back into integers.

set.seed(1)
d <- sample(c(31318, 110221, 103351, 72108, 231533, 155212, 173406), 1e4, replace = TRUE)
i <- as.integer(d)

f1 <- function(x){
  hour <- trunc(x / 1e4)
  min  <- trunc((x - hour * 1e4) / 1e2)
  sec  <- x - hour * 1e4 - min * 1e2
  as.integer(hour * 3600 + min * 60 + sec)
}

f2 <- function(x){
  hh <- x %/% 1e4
  mm <- x %% 1e4 %/% 1e2
  ss <- x %% 1e2
  as.integer(hh * 3600 + mm * 60 + ss)
}

f1i <- function(x){
  hour <- as.integer(x / 1e4L)
  min  <- as.integer((x - hour * 1e4L) / 1e2L)
  sec  <- as.integer(x - hour * 1e4L - min * 1e2)
  hour * 3600L + min * 60L + sec
}

f2i <- function(x){
  hh <- x %/% 1e4L
  mm <- x %% 1e4L %/% 1e2L
  ss <- x %% 1e2L
  hh * 3600L + mm * 60L + ss
}

microbenchmark::microbenchmark(
  f1(i), f2(i), f1i(i), f2i(i), 
  f1(d), f2(d), f1i(d), f2i(d), 
  times = 1e2
)

Unit: microseconds
   expr     min       lq     mean   median       uq      max neval
  f1(i) 277.413 279.4670 316.0315 282.1055 341.3420  928.132   100
  f2(i) 705.557 707.0230 829.8002 710.6880 796.6105 5366.158   100
f1i(i) 355.124 356.5910 451.0255 358.4965 449.4035 3242.158   100
 f2i(i) 346.620 347.7930 391.1675 349.6990 366.5605  989.714   100
  f1(d) 237.824 240.3175 350.9075 242.5170 295.3025 6946.476   100
  f2(d) 702.037 703.9435 869.6909 708.1960 874.7610 5113.378   100
 f1i(d) 341.048 342.9545 514.6488 345.0075 428.8765 4231.285   100
 f2i(d) 705.556 707.3160 777.2969 710.3955 882.5325 1855.678   100

object.size(d) # 80048 bytes
object.size(i) # 40048 bytes
  • Why is there no performance advantage of consistently operating with integers?
  • What is the use of modulus or integer devision if trunc((x - hour * 1e4) / 1e2) is more efficient than x %% 1e4L %/% 1e2L
  • And most important what would be best practice from the point a an experienced R / data.table user?
rluech
  • 606
  • 4
  • 15
  • You are doing arithmetic calculations, while data.table has advantages when integers are used for grouping or joins, I guess. Not really clear if/how you need data.table. – Frank Aug 13 '18 at 14:07
  • 1
    I believe some functions will coerce `integer` to `numeric` and some will do the opposite. You'll be faster when you avoid conversions, you can get a clue about the expected type by looking at the parameter description or/and at the default values. I don't know much about `C` and `C++` but I believe R integers are translated to `C` or `C++` numeric (from what I understood there: https://stackoverflow.com/questions/51010539/how-to-avoid-that-anytimenumeric-updates-by-reference/51010657#51010657 ), so functions operating in `C` like `%%` and `%/%` might show an overhead due to this conversion. – moodymudskipper Aug 13 '18 at 14:09
  • Please show definitions of `d` and `i`. – Bhas Aug 13 '18 at 14:12
  • Btw, if you really have 10000 obs but 7 distinct values, you can do this instantly (group by unique values and do the computation...) – Frank Aug 13 '18 at 14:13
  • 1
    @Frank Yea, data.table is only loosely related to the question. It might serves some understanding for my fuzzy performance issue. That's why my fift tag originally was not data.table but operatros – rluech Aug 13 '18 at 14:15
  • @Frank absolutely, but this is really just an example here. I do the calculation by grouping even with 86400 seconds a day. – rluech Aug 13 '18 at 14:18
  • Ok cool, I guess you know what I mean, but: `system.time(z <- setDT(list(i))[, dur := V1%/%1e4L*3600L + (V1 %% 1e4L %/% 1e2L * 60L) + V1 %% 1e2L, by=V1][])` where z$dur is the desired vector. – Frank Aug 13 '18 at 14:23
  • your benchmark actually shows that any function used on `i` is faster than when it's used on `d`, `f1i` is slower than `f1` because it goes through `as.integer` multiple times. The only strange thing I see is that `f2i(d)` is faster than `f2(d)` in your benchmark, but it's not something I cannot reproduce on my machine. – moodymudskipper Aug 13 '18 at 14:36
  • @Frank I guess your point is it really is fuss as long as using data.table? Fair enough. – rluech Aug 13 '18 at 14:41
  • @Moody_Mudskipper My benchmarks vary from run to run a bit... but the pattern stays the same I think. f1(d) is fastest, suggesting keep everithing as doubles is best. This contradicts my naive expectation, see above. Also the elegant modulo and integer devision are apparently not that elegant after all ... – rluech Aug 13 '18 at 14:47
  • I was looking at the mean when I wrote my comment, and then `f1(i)` is fastest, looking at the median we see what you describe indeed – moodymudskipper Aug 13 '18 at 14:52
  • 1
    I guess there are two points to make (neither of which is deep): you need a very large vector for this to be slow enough to matter; and if you have a large vector but few distinct values in it, of course grouping can make it faster. A benchmark here: https://chat.stackoverflow.com/transcript/message/43582021#43582021 – Frank Aug 13 '18 at 16:08

0 Answers0