0

I have a huge txt file with over 600 million rows and around 27 GB. I used the fread from data.table on a server with 256GB RAM and 32 processors. It took around 3.5 hours to complete reading 10% of the data. In that case, to only read in this table, it will take around 35 hours on my server. What is the faster way to read such big dataset? 1) split it into multiple small files first, and read in? 2) does multicore work for the fread?

Any suggestions and comments are appreciated!

user1786747
  • 85
  • 1
  • 4
  • 5
    May I ask what you plan to do with this data once you have loaded it into R? Even on a server with 256GB RAM, manipulating your entire 27GB data set could be pushing the limits. – Tim Biegeleisen Dec 15 '15 at 02:30
  • 1
    Have you considered using a database option, such as `RMySQL`? – Vedda Dec 15 '15 at 02:34
  • 1
    Provide more details, 3.5h using fread? If you are able to split file into few pieces you can try to use fread in parallel using `Rserve`, it is basically what I've made in [big.data.table](https://github.com/jangorecki/big.data.table) package. It also allows to split processing into multiple nodes/machines in parallel. You can have it without the package just by `Rserve` + `RSclient` + `lapply`. – jangorecki Dec 15 '15 at 03:12
  • 3
    Are you sure that you are not running into a hardware i/o limitation? `fread` can read 2.7 GB of data almost instantaneously. – Roland Dec 15 '15 at 08:16
  • 1
    Not enough information - I've read in 50Gb files in a matter of minutes before, so something is fishy about the OP. Maybe add a small sample of your data. – eddi Dec 15 '15 at 16:46
  • @Roland hardware i/o? the only scenario I can imagine is network storage on a very slow network - and that can be easily tested by doing smth like `wc -l filename` – eddi Dec 15 '15 at 16:49
  • @eddi That's exactly what I'm suspecting. – Roland Dec 16 '15 at 07:56
  • 600M? What do these rows contain, Facebook's passwords? :P – Lefteris008 Mar 24 '19 at 07:31

3 Answers3

4

We need to see a sample of the data and also the output of fread(...,verbose=TRUE) as requested in point 7 on the support page. I agree with the comments that it feels likes something is wrong. It is possible for R's hash algorithm to be defeated and for data to never load (using any method), as demonstrated by Pall Melsted here. Maybe something like that is happening if you have some dense hex keys in a not-real-world-data file, perhaps.

To answer the question (if H2O is acceptable), my testing on a 23GB .csv with 9 columns and 500m rows showed h2o.importFile to take 50s compared to fread at 300s; i.e. 6 times faster. See slides 18-20 on Presentations page or direct link here. H2O feels like R (or Python) but just has a Java backend rather than a C backend. h2o.importFile() is parallel, distributed and compresses the data in memory too.

When PayPal tested on their data they found h2o.importFile() to be 10 times faster than fread(). See 19:00-20:00 in this video.

Other parallel solutions that I know of are the SparkR package (e.g. this question for reading files) which I haven't yet tested or timed and @jangorecki's new big.data.table package which looks very promising.

Community
  • 1
  • 1
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
1

I'm sure there is a thread-safe variant of fread but remember: to get safety you give away speed.

Reading using multiple cores is a waste of time, as is reading the records one by one:

if (fread (buffer, reclen, 600000000, infile) ...

This variant will perform 600000000 logical reads (reads from internal buffers) and not quite as many physical reads (actually reading the disk). If the records are of equal length:

if (fread (buffer, reclen * 600000000, 1, infile) ...

You will save an enormous amount of overhead, which you also will with a compromise:

i = 0;
while (i<600000000)
{
  fread (buffer+i*reclen, reclen * 1000, 1, infile);
  i += 1000;
}

This compromise cuts overhead 99.9% but you will still need to loop 600000 times. Read 10000 at a time? 60000? Experiment!

Olof Forshell
  • 3,169
  • 22
  • 28
0

It's worth considering the recently (2021) released vroom package. Its web page claims the following benchmarks:

package version time (sec) speedup throughput
vroom 1.3.0 1.11 67.13 1.48 GB/sec
data.table 1.13.0 13.12 5.67 125.19 MB/sec
readr 1.3.1 32.57 2.28 50.41 MB/sec
read.delim 4.0.2 74.37 1.00 22.08 MB/sec

(Although comments on the original post suggest that the R package machinery may not be the bottleneck, because reading 2.7 Gb with data.table::fread() should have been fast even in 2015.)

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453