12

I wanted to know if there is a limit to the number of rows that can be read using the data.table fread function. I am working with a table with 4 billion rows, 4 columns, about 40 GB. It appears that fread will read only the first ~ 840 million rows. It does not give any errors but returns to the R prompt as if it had read all the data !

I understand that fread is not for "prod use" at the moment, and wanted to find out if there was any timeframe for implementation of a prod-release.

The reason I am using data.table is that, for files of such sizes, it is extremely efficient at processing the data compared to loading the file in a data.frame, etc.

At the moment, I am trying 2 other alternatives -

1) Using scan and passing on to a data.table

data.table(matrix(scan("file.csv",what="integer",sep=","),ncol=4))

Resulted in --
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  too many items

2) Breaking the file up into multiple individual segments with a limit of approx. 500 million rows using Unix split and reading them sequentially ... then looping over the files sequentially into fread - a bit cumbersome, but appears to be the only workable solution.

I think there may be an Rcpp way to do this even faster, but am not sure how it is generally implemented.

Thanks in advance.

xbsd
  • 2,438
  • 4
  • 25
  • 35
  • 1
    Make sure there is nothing unusual in your file in the last line that was read or the line after that and then [submit](https://r-forge.r-project.org/tracker/?group_id=240) a bug report or contact the package mantainer. – Roland Jul 11 '13 at 14:49
  • Are you sure you have enough RAM? And are you working with 64bit R? – eddi Jul 11 '13 at 17:11
  • No, there is not necessarily a faster way with Rcpp as Matt already uses mmap. Check your OS documentation for limits on the mmap call. Billions may be pushing it... – Dirk Eddelbuettel Jul 11 '13 at 22:59
  • 64 bit R with several hundred GBs of RAM. I was able to finally complete the task using a combination of foreach dopar and mclapply, essentially splitting the file into smaller files with 500 M rows with Unix split, very fast, ... Then fread-ing the individual files into a collector type list, and thereafter processing each chunk using standard data table operations. Total time to read full 40 GB / all 4 billion rows was 10 minutes. Will post more details shortly ... – xbsd Jul 12 '13 at 00:14
  • 1
    Re: faster ways to read the file, ... In my experience such tasks could be completed in just a few minutes with native KDB+ (kx.com). Given it's written in C (as far as I know) I have wondered if we could achieve the same speeds in R ... KDB also mmaps the files, but uses some super optimized code (the entire db binary is like 1 Meg !). Bit off-topic, but interesting nonetheless. – xbsd Jul 12 '13 at 00:26
  • FWIW, R 3.0 limits dimensions of matrices / arrays to a max 2^31 elements in each dimension; 4 billion rows is beyond that 2^31 limit and hence solution 1) isn't viable. That said, I wonder if `scan` and friends have been modified to accommodate new inputs that go beyond old R vector limits. – Kevin Ushey Jul 12 '13 at 05:43
  • 1
    @CauchyDistributedRV: R _before_ 3.0.0 was limited to 2^31 - 1; R 3.0.0 later moved that by switching to indexing via doubles. See the NEWS file for more but as I recall it is now 2^35 - 1. That said, your point is still valid. – Dirk Eddelbuettel Jul 12 '13 at 06:02
  • Thanks everyone for the feedback ! – xbsd Jul 12 '13 at 14:26
  • @DirkEddelbuettel: Shouldn't the 53 bits of double mantissa give a 2^53- 1 "reach"? – IRTFM Sep 15 '13 at 06:38

1 Answers1

8

I was able to accomplish this using feedback from another posting on Stackoverflow. The process was very fast and 40 GB of data was read in about 10 minutes using fread iteratively. Foreach-dopar failed to work when run by itself to read files into new data.tables sequentially due to some limitations which are also mentioned on the page below.

Note: The file list (file_map) was prepared by simply running --

file_map <- list.files(pattern="test.$")  # Replace pattern to suit your requirement

mclapply with big objects - "serialization is too large to store in a raw vector"

Quoting --

collector = vector("list", length(file_map)) # more complex than normal for speed 

for(index in 1:length(file_map)) {
reduced_set <- mclapply(file_map[[index]], function(x) {
  on.exit(message(sprintf("Completed: %s", x)))
  message(sprintf("Started: '%s'", x))
  fread(x)             # <----- CHANGED THIS LINE to fread
}, mc.cores=10)
collector[[index]]= reduced_set

}

# Additional line (in place of rbind as in the URL above)

for (i in 1:length(collector)) { rbindlist(list(finalList,yourFunction(collector[[i]][[1]]))) }
# Replace yourFunction as needed, in my case it was an operation I performed on each segment and joined them with rbindlist at the end.

My function included a loop using Foreach dopar that executed across several cores per file as specified in file_map. This allowed me to use dopar without encountering the "serialization too large error" when running on the combined file.

Another helpful post is at -- loading files in parallel not working with foreach + data.table

Community
  • 1
  • 1
xbsd
  • 2,438
  • 4
  • 25
  • 35