fread memory usage is much larger than the file

Question

I am on a 512gb ram server. I have a 84gig CSV (hefty, I know). I am reading only 31 columns of 79, where the excluded are all floats/decimals.

After comparing many methods, it seems the highest performance way to do what I want would be to fread the file. The file size is 84gb, but watching "top" the process uses 160 gigs of memory (RES), even though the size of the eventual data.table is about 20gigs.

I know fread preallocates memory which is why it's so fast. Just wondering - is this normal and is there a way to curb the memory consumption?

Edit: it seems like, if I just ask fread to read 10000 rows (of 300MM), fread will still preallocate 84 gigs of memory.

Maybe `fread` pieces of the file at a time and combine the result in R, but if your server as more than enough RAM I don't see what the issue is. Specifying the `colClasses` might help if you aren't doing so already. — nrussell, Jan 10 '16 at 18:51
Thanks, I'll try colClasses. The issue is just that I don't want to consume the shared resources of the server, to the extent possible. Also, the files are not guaranteed to be this pleasant. It is market data, and on certain days I imagine the data size may explode. — grad student, Jan 10 '16 at 19:01

score 3 · Answer 1 · answered Jan 10 '16 at 20:04

3

See R FAQ 7.42. If you want to minimize the resources you use on the server, read the csv using fread once, then save the resulting object using save or saveRDS. Then read that binary file when you need the data.

Or you can use a command line tool like cut, awk, sed, etc to only select the columns you want and write the output to another file. Then you can use fread on that smaller file.

answered Jan 10 '16 at 20:04

Joshua Ulrich

173,410
32
338
418

After loading data.table from `load` or `readRDS` user should call `alloc.col` on it, as it silently loses it's pre-allocated columns. – jangorecki Jan 10 '16 at 22:24

score 0 · Answer 2 · edited May 23 '17 at 11:46

0

Try to see http://www.r-bloggers.com/efficiency-of-importing-large-csv-files-in-r/ or Reading 40 GB csv file into R using bigmemory.

May be bigmemory library helps you.

edited May 23 '17 at 11:46

Community

1
1

answered Jan 10 '16 at 19:33

V. Gai

450
3
9
30

Thanks, but it doesn't. The issue with market data is there are a lot of string identifiers. I guess SQL imports would be the right way to go here. – grad student Jan 10 '16 at 19:42
@APK if you can produce SQL insert script then you can `fread` it using `awk`. It is documented in [data.table#878](https://github.com/Rdatatable/data.table/issues/878). – jangorecki Jan 10 '16 at 22:27

fread memory usage is much larger than the file

2 Answers2