0

I'm trying to import a very large dataset (101 GB) from a text file using read.table.ffdf in package ff. The dataset has >285 million records, but I am only able to read in the first 169,457,332 rows. The dataset is tab-separated with 44 variable-width columns. I've searched stackoverflow and other message boards and have tried many fixes, but still am consistently only able to import the same number of records.

Here's my code:

relFeb2016.test <- read.table.ffdf(x = NULL, file="D:/eBird/ebd_relFeb-2016.txt", fileEncoding = "", nrows = -1, first.rows = NULL, next.rows = NULL, header=TRUE, sep="\t", skipNul = TRUE, fill=T, quote = "", comment.char="", na.strings="", levels = NULL, appendLevels = TRUE,                          strip.white=T, blank.lines.skip=F, FUN = "read.table", transFUN = NULL, asffdf_args = list(), BATCHBYTES = getOption("ffbatchbytes"), VERBOSE = FALSE, colClasses=c("factor","numeric","factor","factor","factor","factor","factor", "factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","numeric","numeric","Date","factor","factor","factor","factor","factor","factor","factor","factor",   "factor","factor","numeric","numeric","numeric","factor","factor","numeric","factor","factor"))

Here's what I've tried:

  1. Added skipNUL=TRUE to bypass null characters that I know exist in the data.

  2. Added quote="" and "comment.char="" to bypass quote marks, pound signs, and other characters that I know exist in the data.

  3. Added na.strings="" and fill=TRUE because many fields are left blank.

  4. Tried reading it in with UTF-16 encoding (encoding="UTF-16LE") in case the special characters were still a problem, though EmEditor reports it as UTF-8 unsigned.

  5. More than tripled my memory limit from ~130000 using memory.limit(size=500000).

Here's what I've ruled out:

  1. My data is not fixed-width so I can't use laf_open_fwf in LAF, which solved a similar problem described here: http://r.789695.n4.nabble.com/read-table-ffdf-and-fixed-width-files-td4673220.html

  2. I can't use bigmemory because my data includes a variety of data types (factor, date, integer, numeric)

  3. There's nothing special about that last imported record that should cause the import to abort

  4. Because it consistently reads in the same number of records each time, and it's always a block of the first 169+ million records, I don't think the problem can be traced to special characters, which occur earlier in the file.

Is there an upper limit on the number of records that can be imported using read.table.ffdf? Can anyone recommend an alternative solution? Thanks!

ETA: No error messages are returned. I'm working on a server running Windows Server 2012 R2, with 128GB RAM and >1TB available on the drive.

Michel
  • 1
  • 1
  • So no error message? Your post does not include OS information. – IRTFM May 01 '16 at 23:43
  • Thanks 42. I've edited my question. No, there are no error messages, and I'm working on a Windows server. – Michel May 02 '16 at 01:31
  • The theoretic upper number of records max is probably `.Machine$integer.max` #[1] 2147483647 but if you are getting no errors I would be more suspicious of an extraneous end-of-file mark. I would not think you would be wise to set a memory limit of more than addressable RAM, but I'm not a Windows user. I'd also be tempted to try setting colClasses to "character". – IRTFM May 02 '16 at 04:33
  • What `read.table.ffdf` does is repeatedly call `read.table` and append the results to the `ffdf`. You could simulate the reading of the file without creating the `ffdf` to see where and if the error occurs. Something like: `con <- file("yourfile.csv", "rt"); while (TRUE) { d <- read.table(con, colClasses = ..., nrow = 1E6)); if (nrow(d) == 0) break; # check d; eg count rows; check number columns; column types etc. }; close(con);` – Jan van der Laan May 02 '16 at 08:21
  • Hope you can read the code above :-). `LaF` also has a `laf_open_csv` that you might want to try. However, as @42- I suspect that there is some error such as a eof-byte somewhere in your file in which case `laf_open_csv` will also fail at the same line. – Jan van der Laan May 02 '16 at 08:24
  • Thanks everyone for your help! There were additional nul characters that ff was interpreting as eof bytes. I removed them and was able to import the complete file. – Michel May 11 '16 at 16:19

0 Answers0