4

I am on a 512gb ram server. I have a 84gig CSV (hefty, I know). I am reading only 31 columns of 79, where the excluded are all floats/decimals.

After comparing many methods, it seems the highest performance way to do what I want would be to fread the file. The file size is 84gb, but watching "top" the process uses 160 gigs of memory (RES), even though the size of the eventual data.table is about 20gigs.

I know fread preallocates memory which is why it's so fast. Just wondering - is this normal and is there a way to curb the memory consumption?


Edit: it seems like, if I just ask fread to read 10000 rows (of 300MM), fread will still preallocate 84 gigs of memory.

grad student
  • 107
  • 7
  • Maybe `fread` pieces of the file at a time and combine the result in R, but if your server as more than enough RAM I don't see what the issue is. Specifying the `colClasses` might help if you aren't doing so already. – nrussell Jan 10 '16 at 18:51
  • Thanks, I'll try colClasses. The issue is just that I don't want to consume the shared resources of the server, to the extent possible. Also, the files are not guaranteed to be this pleasant. It is market data, and on certain days I imagine the data size may explode. – grad student Jan 10 '16 at 19:01

2 Answers2

3

See R FAQ 7.42. If you want to minimize the resources you use on the server, read the csv using fread once, then save the resulting object using save or saveRDS. Then read that binary file when you need the data.

Or you can use a command line tool like cut, awk, sed, etc to only select the columns you want and write the output to another file. Then you can use fread on that smaller file.

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • After loading data.table from `load` or `readRDS` user should call `alloc.col` on it, as it silently loses it's pre-allocated columns. – jangorecki Jan 10 '16 at 22:24
0

Try to see http://www.r-bloggers.com/efficiency-of-importing-large-csv-files-in-r/ or Reading 40 GB csv file into R using bigmemory.

May be bigmemory library helps you.

Community
  • 1
  • 1
V. Gai
  • 450
  • 3
  • 9
  • 30
  • Thanks, but it doesn't. The issue with market data is there are a lot of string identifiers. I guess SQL imports would be the right way to go here. – grad student Jan 10 '16 at 19:42
  • @APK if you can produce SQL insert script then you can `fread` it using `awk`. It is documented in [data.table#878](https://github.com/Rdatatable/data.table/issues/878). – jangorecki Jan 10 '16 at 22:27