0

This is some hellish question related to floating-point approximations and timestamps in R. Get ready :) Consider this simple example:

library(tibble)
library(lubridate)
library(dplyr)

tibble(timestamp_chr1 = c('2014-01-02 01:35:50.858'),
       timestamp_chr2 = c('2014-01-02 01:35:50.800')) %>% 
  mutate(time1 = lubridate::ymd_hms(timestamp_chr1),
         time2 = lubridate::ymd_hms(timestamp_chr2),
         timediff = as.numeric(time1 - time2))


# A tibble: 1 x 5
  timestamp_chr1          timestamp_chr2          time1                      time2                       timediff
  <chr>                   <chr>                   <dttm>                     <dttm>                         <dbl>
1 2014-01-02 01:35:50.858 2014-01-02 01:35:50.800 2014-01-02 01:35:50.858000 2014-01-02 01:35:50.799999 0.0580001

Here the time difference between the two timestasmps is obviously 58 milliseconds, yet R stores that with some floating-point approximation so that it appears as 0.058001 seconds.

What is the safest way to get exactly 58 milliseconds as an asnwer instead? I thought about using as.integer (instead of as.numeric) but I am worried about some loss of information. What can be done here?

Thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • 3
    what's wrong with `round(as.numeric(time1 - time2)*1000)`? – jay.sf Mar 05 '20 at 15:05
  • That seems a bit crude. I would like something more sophisticated that can deal with unexpected cases. I have read about `bigint` in R but not sure it can be applied here. – ℕʘʘḆḽḘ Mar 05 '20 at 15:08
  • 2
    http://dirk.eddelbuettel.com/code/nanotime.html – rawr Mar 05 '20 at 15:13
  • Running your code in Rstudio, my results show fewer digits. Is there something special you do to show all the digits you're getting? – markhogue Mar 05 '20 at 15:16
  • 1
    @markhogue: [`options(digits.secs=6)`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/options.html). – r2evans Mar 05 '20 at 15:17
  • If you're stuck with using `POSIXt`, then if the context of your data is that its real accuracy does not exceed "millisecond", then jay.sf's suggestion (the use of `round`) is your best way forward. Without adding additional packages, another option is to shift away from `POSIXt` to *milliseconds epoch*; while you won't be able to use `integer` (`bigint` perhaps), but `numeric` will cover it without loss of millisecond accuracy, differences can be `integer` if needed, and you can get precisely 58 milliseconds. – r2evans Mar 05 '20 at 15:22
  • thanks @r2evans, I actually dont mind using as many packages as needed. what is your suggestion exactly? – ℕʘʘḆḽḘ Mar 05 '20 at 15:43
  • 1
    Frankly (and I deal with millisecond-resolution all day), I'm comfortable with keeping `POSIXt` and using `round(...,3)`, given the context that the data supports millisecond accuracy. I don't fight the floating-point numbers in the raw data, I just urge milliseconds in *reporting* by using rounding and/or `sprintf`. – r2evans Mar 05 '20 at 15:56
  • 1
    If your problem is in potentially *introducing* error at data entry, then either (1) round on each field when you do any calcs on it (onerous, prone to missing one); (2) shift to milliseconds `integer`, relative to some starting point recent-enough that you don't exceed R's integer (about 24 days of milliseconds); or (3) use `bigint` and shift to milliseconds epoch. – r2evans Mar 05 '20 at 16:04
  • @r2evans thanks again. for completeness do you mind posting your solution with `bigint`? – ℕʘʘḆḽḘ Mar 06 '20 at 13:08

1 Answers1

1

Some considerations, some I think you already know:

  • floating-point will rarely give you perfectly 58 milliseconds (due to R FAQ 7.31 and IEEE-754);

  • display of the data can be managed on the console with options(digits.secs=3) (and digits=3) and in reports with sprintf, format, or round;

  • calculation "goodness" can be improved if you round before calculation; while this is a little more onerous, as long as we can safely assume that the data is accurate to at least milliseconds, this holds mathematically.

If you're concerned about introducing errors in the data, though, an alternative is to encode as milliseconds (instead of the R norm of seconds). If you can choose an arbitrary and recent (under 24 days) reference point, then you can do it with normal integer, but if that is insufficient or you prefer to use epoch milliseconds, then you need to jump to 64-bit integers, perhaps with bit64.

now <- Sys.time()
as.integer(now)
# [1] 1583507603
as.integer(as.numeric(now) * 1000)
# Warning: NAs introduced by coercion to integer range
# [1] NA
bit64::as.integer64(as.numeric(now) * 1000)
# integer64
# [1] 1583507603439
r2evans
  • 141,215
  • 6
  • 77
  • 149