13

I have a csv which includes about 2 million rows of date strings in the format:

2012/11/13 21:10:00 

Lets call that csv$Date.and.Time

I want to convert these dates (and their accompanying data) to xts as fast as possible

I have written a script which performs the conversion just fine (see below), but it's terribly slow and I'd like to speed this up as much as possible.

Here is my current methodology. Does anyone have any suggestions on how to make this faster?

 dt <- as.POSIXct(csv$Date.and.Time,tz="UTC")

idx <- format(dt,tz=z,usetz=TRUE)

So the script converts these date strings to POSIX.ct. It then does a timezone conversion using format (z is a variable representing the TZ to which I am converting). I then do a regular xts call to make this an xts series with the rest of the data in the csv.

This works 100%. It's just very, very slow. I've tried running this in parallel (it doesn't do anything; if anything it makes it worse). What do I mean by 'slow'?

 user    system   elapsed 
155.246  16.430 171.650 

That's on a 3GhZ, 16GB ram 2012 mb pro. I can get about half that on a similar processor with 32GB RAM on a Win7 Machine

I'm sure someone has a better idea - I'm open to suggestions via Rcpp etc. However, ideally the solution works with the csv rather than some other method, like setting up a database. Having said that, I'm up to doing this via whatever method is going to give the fastest conversion.

I'd be super appreciative of any help at all. Thanks in advance.

mbinette
  • 5,094
  • 3
  • 24
  • 32
n.e.w
  • 1,128
  • 10
  • 23
  • 1
    Do you know which step is the one slowing things down - the `as.POSIXct` step, the `format` step or the `xts` step? – mathematical.coffee Nov 30 '12 at 03:38
  • If you search for fasttime (which you wouldn't have known to do without Dirk's answer), you'll find a couple similar Qs [LINK1](http://stackoverflow.com/questions/12898318/convert-character-to-date-quickly-in-r), [LINK2](http://stackoverflow.com/questions/12786335/why-is-as-date-slow-on-a-character-vector) – GSee Nov 30 '12 at 15:32

2 Answers2

22

You want the small and simple fasttime package by Simon which does this in the fastest possible way---by not calling time parsing functions but just using C-level string functions.

It does not support as many formats as strptime. In fact, it doesn't even have a format string. But well-formed ISO format variants, that is yyyy-mm-dd hh:mm:ss.fff will work, and your / separator may just work too.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • My god -- I'm rarely astonished at how effective a solution can be, but I was by this! Thank you so much. Where do I send you guys a cheque?? – n.e.w Nov 30 '12 at 07:05
  • 3
    system.time(dts <- fastPOSIXct(csv$Date.and.Time,"UTC")) user system elapsed 0.065 0.000 0.065 – n.e.w Nov 30 '12 at 07:06
  • It generally helps to know what one is doing, and Simon really has a knack for that :) – Dirk Eddelbuettel Nov 30 '12 at 14:50
4

Try using lubridate - it does all date time parsing using regular expressions, so not only is it much faster, it's also much more flexible.

hadley
  • 102,019
  • 32
  • 183
  • 245