0

I am doing some pre-processing on on data from multiple sources (multiple large CSV's, above 500mb), applying some transformations and ending up with a final tibble dataset whcih has all the data that I need in a tidy "format." At the end of that pre-processing, I save that final tibble as an .RData file that I import later for my subsequent statistical analysis.

The problem is that the tibble dataset is very big (takes 5gb memory in the R workspace) and it is very slow to save and to load. I haven't measured it in time but it takes over 15 minutes to save that object, even with compress = FALSE.

Question: Do I have any (ideally easy) options to speed all this up? I already checked and the data types in the tibble are all as they should be (character is charecter, numeric is dbl etc.)

Thanks

Jean_N
  • 489
  • 1
  • 4
  • 19
  • 3
    Have you looked into a [tag:data.table] approach? – Roman Jan 16 '19 at 14:41
  • If the problem is file saving and loading speeds and not the actual data processing, I would look at the `fst` library (or similar libraries offering different formats to save R datasets) – IceCreamToucan Jan 16 '19 at 14:44
  • @Roman, no, could you please elaborate on that? Ideally, I would like to continue working with that tibble as all my other code (the statistical analysis part) works quite nicely on a tibble. – Jean_N Jan 16 '19 at 14:46
  • @IceCreamToucan: I will take a look at the fst package, thank you. The actual data processing is also taking a lot of time but currently not the biggest PITA. If you have some ideas on how to speed up the actual data processing too, I'd love to hear them. The main problem there (in terms of time) is the loading of the different big csv files that takes a long long time. – Jean_N Jan 16 '19 at 14:48

1 Answers1

0

read_csv and the other tidyr functions aren't the fastest, but they make things really easy. Per the comments on your question, data.table::fread is a great option for speeding up the import of data in to data frames. It is ~7x faster than read_csv. Those data frames can then be easily be changed to tibbles using dplyr::as_tibble. You also may not even need to change the data frames to a tibble prior to processing as most tidyverse functions will accept a data frame input and give you a tibble output.

H5470
  • 91
  • 1
  • 8