2

I have a large data table, (1 billion rows × 50 columns) similar to flights from library(nycflights13), where multiple columns can be combined to form a date.

The code I am currently using to create this date-time column is:

library(data.table)
library(nycflights13)
library(fasttime)

flights <- as.data.table(flights)

flights[,DepDateTime := fastPOSIXct(paste0(year, 
                                           "-",
                                           formatC(month, width = 2, format = "d", flag = "0"),
                                           "-",
                                           formatC(day, width = 2, format = "d", flag = "0"), 
                                           " ",
                                           # replace e.g. 903 with '09:03:00'
                                           gsub("([0-9]{2})([0-9]{2})", "\\1:\\2:00", 
                                                formatC(dep_time, width = 4, 
                                                        format = "d", flag = "0")))
                                           )]

For the flights data, this takes around 0.6 s. Is there any way to improve this performance? I am interested in timing primarily; memory usage is a secondary concern.

Here is a candidate data table:

flights.big <- 
data.table(year = sample(1980:2015, size = 1e9, replace = TRUE),
           month = sample(1:12, size = 1e9, replace = TRUE), 
           day = sample(1:28, size = 1e9, replace = TRUE),
           hour = sample(1:12, size = 1e9, replace = TRUE),
           minute = sample(0:59, size = 1e9, replace = TRUE)
           )
Hugh
  • 15,521
  • 12
  • 57
  • 100
  • What would still be acceptable improvement for 1 billion rows? – Roman Luštrik Jan 02 '16 at 08:13
  • Currently, it's taking approximately 15 minutes to complete. Is 5 minutes possible? Or should I try to obtain better hardware (currently running a 64 Gb RAM machine) – Hugh Jan 02 '16 at 08:16
  • Ah, I was under the impression that your dataset (and flights) takes 0.6 seconds. It would be nice if you could simulate a dataset that mimics yours so that timings would be comparable. – Roman Luštrik Jan 02 '16 at 08:18
  • I've added a candidate data set. Naturally, it is taking a long time to test whether the timings are comparable :-) – Hugh Jan 02 '16 at 08:26
  • This could be useful. http://stackoverflow.com/questions/20483217/improve-performance-of-data-table-datetime-pasting?rq=1 – Hugh Jan 02 '16 at 08:31
  • 4
    I think you can expect some slight performance improvements by using `sub` instead of `gsub` with `perl = TRUE` and replace `formatC` with `sprintf` using `"%02d"` -- especially when getting rid of `paste` and concatenating within the `sprintf` call, something like `sprintf("%d-%02d-%02d %s", year, month, day, sub("([0-9]{2})([0-9]{2})", "\\1:\\2:00", sprintf("%04d", dep_time), perl = TRUE))` – daroczig Jan 02 '16 at 08:54
  • 3
    `fastPOSIXct` seems to handle correctly missing 0s and meaningless separators in the soon-to-be-"Date" "character" vector; so you could avoid many calls, to `formatC/sub/etc`. E.g. `identical(fastPOSIXct("2013-05-24 05:04:21"), fastPOSIXct("2013-5-24-5-4-21"))`. BTW, since you -already- have numeric vectors for y/m/d/h/m you could search for other ways to convert to "Date" avoiding the intermediate "character" conversion. – alexis_laz Jan 02 '16 at 09:59
  • Please provide an example of your actual data. Your candidate data set does not include a `dep_time` column, but you say "the code I am currently using" does. – Joshua Ulrich Jan 02 '16 at 15:53

2 Answers2

2

I used lubridate and stringr to get around a 25% performance boost on the flights data. Unfortunately I'm not currently on a computer which can handle a data set as big as your full set so hopefully it scales.

library(data.table)
library(nycflights13)
library(fasttime)
library(microbenchmark)
library(lubridate)
library(stringr)

flights <- as.data.table(flights)

op1 <- microbenchmark(
  flights[,DepDateTime := fastPOSIXct(paste0(year, 
                                             "-",
                                             formatC(month, width = 2, format = "d", flag = "0"),
                                             "-",
                                             formatC(day, width = 2, format = "d", flag = "0"), 
                                             " ",
                                             # replace e.g. 903 with '09:03:00'
                                             gsub("([0-9]{2})([0-9]{2})", "\\1:\\2:00", 
                                                  formatC(dep_time, width = 4, 
                                                          format = "d", flag = "0")))
  )],
  times=50L)

op2 <- microbenchmark(
  flights[,DepDateTime := ymd_hm(paste(year, 
                                       month, 
                                       day, 
                                       str_pad(dep_time,
                                               width = 4,
                                               side = "left",
                                               pad = "0"), 
                                       sep = "-"))],
  times=50L)

The benchmarks on my computer are

 >op1
      min       lq     mean   median       uq      max neval
 3.385542 3.526347 3.739545 3.679273 3.855418 4.594314    50
>op2
      min       lq     mean   median       uq      max neval
 2.536882 2.589711 2.733829 2.715038 2.835111 3.194575    50
NGaffney
  • 1,542
  • 1
  • 15
  • 16
0

A considerable speed increase was achieve by using joins and sprintf in a function (create_fn). The increase is more modest for the smaller dataset:

enter image description here

library(data.table)
library(nycflights13)
library(fasttime)
library(microbenchmark)
library(ggplot2) # for autoplot

create_DepDateTime <- function(DT){
  setkey(DT, year, month, day, dep_time)
  unique_dates <- unique(DT[,list(year, month, day, dep_time)])
  unique_dates[,DepDateTime := fastPOSIXct(sprintf("%d-%02d-%02d %s", year, 
                                                   month, 
                                                   day, 
                                                   sub("([0-9]{2})([0-9]{2})", 
                                                       "\\1:\\2:00",
                                                       sprintf("%04d", dep_time), 
                                                       perl = TRUE)), 
                                           tz = "GMT")]
  DT[unique_dates]
}

flights <- as.data.table(flights)

BENCHMARK <- function(){
  flights[,DepDateTime := fastPOSIXct(paste0(year, 
                                             "-",
                                             formatC(month, width = 2, 
                                                     format = "d", flag = "0"),
                                             "-",
                                             formatC(day, width = 2, 
                                                     format = "d", flag = "0"), 
                                             " ",
                                             # replace e.g. 903 with '09:03:00'
                                             gsub("([0-9]{2})([0-9]{2})", 
                                                  "\\1:\\2:00", 
                                                  formatC(dep_time, 
                                                          width = 4, 
                                                          format = "d", 
                                                          flag = "0")))
  )]
}

NGaffney_lubridate <- function(){
  flights[,DepDateTime := lubridate::ymd_hm(paste(year, 
                                                  month, 
                                                  day, 
                                                  stringr::str_pad(dep_time,
                                                                   width = 4,
                                                                   side = "left",
                                                                   pad = "0"), 
                                                  sep = "-"))]
}
create_fn <- function(){
  flights <- create_DepDateTime(flights)
}

autoplot(
  microbenchmark(
  BENCHMARK(),
  NGaffney_lubridate(),
  create_fn(),
  times=50L
  )
)
Hugh
  • 15,521
  • 12
  • 57
  • 100