1

With a large file (1GB) created by saving a large data.frame (or data.table) is it possible to very quickly load a small subset of rows from that file?

(Extra for clarity: I mean something as fast as mmap, i.e. the runtime should be approximately proportional to the amount of memory extracted, but constant in the size of the total dataset. "Skipping data" should have essentially zero cost. This can be very easy, or impossible, or something in between, depending on the serialiization format. )

I hope that the R serialization format makes it easy to skip forward through the file to the relevant portions of the file.

Am I right in assuming that this would be impossible with a compressed file, simply because gzip requires to uncompress everything from the beginning?

 saveRDS(object, file = "", ascii = FALSE, version = NULL,
         compress = TRUE, refhook = NULL)

But I'm hoping binary (ascii=F) uncompressed (compress=F) might allow something like this. Use mmap on the file, then quickly skip to the rows and columns of interest?

I'm hoping it has already been done, or there is another format (reasonably space efficient) that allows this and is well-supported in R.

I've used things like gdbm (from Python) and even implemented a custom system in Rcpp for a specific data structure, but I'm not satisfied with any of this.

After posting this, I worked a bit with the package ff (CRAN) and am very impressed with it (not much support for character vectors though).

Aaron McDaid
  • 26,501
  • 9
  • 66
  • 88

1 Answers1

3

Am I right in assuming that this would be impossible with a compressed file, simply because gzip requires to uncompress everything from the beginning?

Indeed, for a short explanation let's take some dummy method as starting point:

AAAAVVBABBBC gzip would do something like: 4A2VBA3BC

Obviously you can't extract all A from the file without reading it all as you can't guess if there's an A at end or not.

For the other question "Loading part of a saved file" I can't see a solution on top of my head. You probably can with write.csv and read.csv (or fwrite and fread from the data.table package) with skipand nrows parameters could be an alternative.

By all means, using any function on a file already read would mean loading the whole file in memory before filtering, which is no more time than reading the file and then subsetting from memory.

You may craft something in Rcpp, taking advantage of streams for reading data without loading them in memory, but reading and parsing each entry before deciding if it should be kept or not won't give you a real better throughput.

saveDRS will save a serialized version of the datas, example:

> myvector <- c("1","2","3").
> serialize(myvector,NULL)
 [1] 58 0a 00 00 00 02 00 03 02 03 00 02 03 00 00 00 00 10 00 00 00 03 00 04 00 09 00 00 00 01 31 00 04 00 09 00 00 00 01 32 00 04 00 09 00 00
[47] 00 01 33

It is of course parsable, but means reading byte per byte according to the format.

On the other hand, you could write as csv (or write.table for more complex data) and use an external tool before reading, something along the line:

z <- tempfile()
write.table(df, z, row.names = FALSE)
shortdf <- read.table(text= system( command = paste0( "awk 'NR > 5 && NR < 10 { print }'" ,z) ) )

You'll need a linux system with wich is able to parse millions of lines in a few milliseconds, or to use a windows compiled version of obviously.

Main advantage is that is able to filter on a regex or some other conditions each line of data.

Complement for case of data.frame, a data.frame is more or less a list of vectors (simple case), this list will be saved sequentially so if we have a dataframe like:

> str(ex)
'data.frame':   3 obs. of  2 variables:
 $ a: chr  "one" "five" "Whatever"
 $ b: num  1 2 3

It's serialization is:

> serialize(ex,NULL)
  [1] 58 0a 00 00 00 02 00 03 02 03 00 02 03 00 00 00 03 13 00 00 00 02 00 00 00 10 00 00 00 03 00 04 00 09 00 00 00 03 6f 6e 65 00 04 00 09 00
 [47] 00 00 04 66 69 76 65 00 04 00 09 00 00 00 08 57 68 61 74 65 76 65 72 00 00 00 0e 00 00 00 03 3f f0 00 00 00 00 00 00 40 00 00 00 00 00 00
 [93] 00 40 08 00 00 00 00 00 00 00 00 04 02 00 00 00 01 00 04 00 09 00 00 00 05 6e 61 6d 65 73 00 00 00 10 00 00 00 02 00 04 00 09 00 00 00 01
[139] 61 00 04 00 09 00 00 00 01 62 00 00 04 02 00 00 00 01 00 04 00 09 00 00 00 09 72 6f 77 2e 6e 61 6d 65 73 00 00 00 0d 00 00 00 02 80 00 00
[185] 00 ff ff ff fd 00 00 04 02 00 00 00 01 00 04 00 09 00 00 00 05 63 6c 61 73 73 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 0a 64 61 74 61
[231] 2e 66 72 61 6d 65 00 00 00 fe

Translated to ascii for an idea:

X
    one five    Whatever?ð@@    names   a   b       row.names
ÿÿÿý    class   
data.frameþ

We have the header of the file, the the header of the list, then each vector composing the list, as we have no clue on how much size the character vector will take we can't skip to arbitrary datas, we have to parse each header (the bytes just before the text data give it's length). Even worse now to get the corresponding integers, we have to go to the integer vector header, which can't be determined without parsing each character header and summing them.

So in my opinion, crafting something is possible but will probably not be really much quicker than reading all the object and will be brittle to the save format (as R has already 3 formats to save objects).

Some reference here

Same view as the serialize output in ascii format (more readable to get how it is organized):

> write(rawToChar(serialize(ex,NULL,ascii=TRUE)),"")
A
2
197123
131840
787
2
16
3
262153
3
one
262153
4
five
262153
8
Whatever
14
3
1
2
3
1026
1
262153
5
names
16
2
262153
1
a
262153
1
b
1026
1
262153
9
row.names
13
2
NA
-3
1026
1
262153
5
class
16
1
262153
10
data.frame
254
Tensibai
  • 15,557
  • 1
  • 37
  • 57
  • *"but reading and parsing each entry before deciding if it should be kept or not won't give you a real better throughput."* It is not necessary to read everything. `fseek` can skip over arbitrarily large pieces of data in constant time. The real question is whether the format allows us to know the exact size (on disk) of a sub-data-structure that we wish to ignore. – Aaron McDaid Jul 16 '16 at 07:08
  • @AaronMcDaid Not really, the format is sequential. Reading a data.frame is more or less reading a list, the code is [here](https://github.com/wch/r-source/blob/73e11b7c40d3630604855e8eee3d1f309e2c9a57/src/main/serialize.c#L1611-L1645) if you want an idea of what I mean. In brief what I mean is that you can't really 'skip' N rows, because you'll have to do multiple fseeks for each row. I'll add some details on this in the answer. – Tensibai Jul 18 '16 at 08:10
  • 1
    Thanks for the clarification. I'll read your answer again. Actually, I've just finished writing my own code to solve this, storing the columns of a data.frame in a series of `bigmemory` objects [(bigmemory on CRAN)](https://cran.r-project.org/web/packages/bigmemory/index.html). This allows arbitrary seeking to any row. I had to take care to store `character` vectors in a special way, but it's working now. – Aaron McDaid Jul 18 '16 at 11:50
  • @AaronMcDaid You should try benchmarking it with `microbenchmark` to see if it really brings improvement. The best options in term of speed I know of are `fread` and `fwrite` from package `data.table` which does some kind of parallel processing. I don't see how using `bigmemory` could improve the loading speed as the whole object would be loaded in memory from disk, not saving any IO at all. – Tensibai Jul 18 '16 at 12:16
  • 1
    `filebacked.big.matrix` is stored on disk, and is therefore loaded up with `mmap` in essentially no time. I can access arbitrary rows within seconds from a fresh start of R in a 10GB file. The whole object is not "loaded" as such, merely mapped into virtual memory – Aaron McDaid Jul 18 '16 at 16:19
  • Interesting point, I'll have a look at this, would you mind sharing a simplified version of your approach to benchmark it against others ? – Tensibai Jul 18 '16 at 21:17
  • @ Tensibai , yes. I've written [a short blog post about it](http://www.aaronmcdaid.com/2016/07/load-huge-data-files-into-r-instantly.html), with a link to my code. – Aaron McDaid Jul 20 '16 at 09:46