I wanted to know if there is a limit to the number of rows that can be read using the data.table fread function. I am working with a table with 4 billion rows, 4 columns, about 40 GB. It appears that fread will read only the first ~ 840 million rows. It does not give any errors but returns to the R prompt as if it had read all the data !
I understand that fread is not for "prod use" at the moment, and wanted to find out if there was any timeframe for implementation of a prod-release.
The reason I am using data.table is that, for files of such sizes, it is extremely efficient at processing the data compared to loading the file in a data.frame, etc.
At the moment, I am trying 2 other alternatives -
1) Using scan and passing on to a data.table
data.table(matrix(scan("file.csv",what="integer",sep=","),ncol=4))
Resulted in --
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
too many items
2) Breaking the file up into multiple individual segments with a limit of approx. 500 million rows using Unix split and reading them sequentially ... then looping over the files sequentially into fread - a bit cumbersome, but appears to be the only workable solution.
I think there may be an Rcpp way to do this even faster, but am not sure how it is generally implemented.
Thanks in advance.