0

I am trying to extract a list of POSIXct login times from a large .csv (~11m rows), then use the cut function to tabulate the number of logins per 15-minute block.

Given the size of the dataset, I am using the data.table function. I have managed to achieve my objective, however I have run into some problems described below:

#selective fread dt <- fread("foo.csv", colClasses=list(NULL=c(1:5,8:14), "POSIXct"=c(5,6)) )

Issue: I tried to store the 2 relevant columns as POSIXct classes but it appears to be stored as a character class instead:

> class(dt$login_datetime) [1] "character"

I managed to run the rest of my code by using as.POSIXct as shown below:

timeLog <- dt[,1, with=FALSE] timeLog<- timeLog[,login_datetime:=as.POSIXct(login_datetime)] tabulate <- data.frame(table(cut(timeLog, breaks="15 mins")))

However, the second line takes about 12 minutes to run on my machine. I need to process more datasets in a similar fashion, and while 12 minutes is not devastatingly slow I am curious as to whether I can speed up this process (short of hardware upgrades).

Specifically, I tried to get fread to store the relevant columns as POSIXct classes directly and was unable to. I was unable to find anything regarding POSIXct in the data.table vignette here.

Would anyone be able to tell me if 1) I am doing something wrong regarding fread and colClasses="POSIXct", or 2) if there are other ways/packages to speed up conversion of a data.table column to POSIXct?

Thanks.

ethane
  • 319
  • 3
  • 7
  • 13
  • See lubridate::lubridate with `options(lubridate.fasttime = TRUE)` – bergant Mar 17 '15 at 16:47
  • @bergant: that requires the [fasttime](http://www.rforge.net/fasttime/) package, which isn't on CRAN. – Joshua Ulrich Mar 17 '15 at 16:53
  • 1
    Thanks, I will look into that. Is there any reason why `fasttime` isn't on CRAN? Compatibility issues etc, or is it just annoying to deal with CRAN's standards? – ethane Mar 17 '15 at 18:52

1 Answers1

3

I would suggest two options.

I assume you use write.csv or similar, which convert POSIXct to character when writing it out. This slows down both the writing out and reading in, as POSIXct object are really numbers and not characters (more precisely they are seconds since "epoch"). So you can convert the column to numeric, and then write that out, and convert back to POSIXct after reading in (which will be super fast).

Another option, if you prefer to write out character columns, is to use fastPOSIXct from fasttime to speed up the conversion to POSIXct.

eddi
  • 49,088
  • 6
  • 104
  • 155
  • By "I assume you use write.csv or similar", do you mean to say that the original .csv dataset was written using the `write.csv` command? As for your solution, are you saying I should read the data in as `colClasses=numeric`, and the conversion from `numeric` to `POSIXct` is faster? – ethane Mar 17 '15 at 18:50
  • Yes to first question. To second question I suggest writing out those columns in numeric format (which will require explicit conversion if you use `write.csv`, i.e. `foo[, numCol := as.numeric(posixCol)]` and then write that out) and then reading back in numeric format and converting back to posix (i.e. after reading in - `foo[, posixCol := as.POSIXct(numCol)]`). – eddi Mar 17 '15 at 19:17
  • Hmm well the dataset did not come from me and I do not have control over it. I guess I will have to try reading it in as `colClasses=numeric`, and failing that, look into `fasttime`. . . . . .. . . . .. . . . .. . . . .. . . . . . . . . . . . . (trying to add a line break but unable to do so). . .. . .. . . . .. . .. . . .. . . .In any case, how is it different if the .csv is written as `numeric` or `character`? I thought that a .csv simply stores the data, ie it stores a "7" as a byte regardless of whether it was numeric or a character. – ethane Mar 17 '15 at 19:47
  • `colClasses=numeric` won't help unfortunately, since your file stores text, not a number, which brings us to second part: .csv files are simple text files, *not* binary files, so when you write the number `12345` in a csv file, it's written as text and so will use 5 bytes. – eddi Mar 17 '15 at 19:56
  • i see. since i can't control the way the .csv is written i guess my only option is to look at the `fasttime` package, or deal with the speed. thanks! – ethane Mar 17 '15 at 23:31