2

Let me explain the question:

I know the functions table or xtabs compute contingency tables, but they expect a data.frame, which is always stored in RAM. It's really painful when trying to do this on a big file (say 20 GB, the maximum I have to tackle).

On the other hand, SAS is perfectly able to do this, because it reads the file line by line, and updates the result in the process. Hence there is ever only one line in RAM, which is much more acceptable.

I have done the same as SAS with ad-hoc Python programs on occasion, when I had to do more complicated stuff that either I didn't know how to do in SAS or thought it was too cumbersome. Python syntax and integrated features (dictionaries, regular expressions...) compensate for its weaknesses (speed, mainly, but when reading 20 GB, speed is limitated by the hard drive anyway).

My question, then: I would like to know if there are packages to do this in R. I know it's possible to read a file line by line, like I do in Python, but computing simple statistics (contingency tables for instance) on a big file is such a basic task that I feel there should be some more or less "integrated" feature to do it in a statistical package.

Please tell me if this question should be asked on "Cross Validated". I had a doubt, since it's more about software than statistics.

  • 1
    There are packages designated to store objects on hard drives like `R.huge` or `bigmemory`. Matthew Keller described neatly various solutions on [his website](http://www.matthewckeller.com/html/memory.html). – Konrad Apr 26 '15 at 10:06

1 Answers1

2

You can use the package ff for this which uses the hard disk drive instead of RAM but it is implemented in a way that it doesn't make it (significantly) slower than the normal way R uses RAM.

This if from the package description:

The ff package provides data structures that are stored on disk but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory.

I think this will solve your problem of loading a 20GB file in RAM. I have used it myself for such purposes and it worked great.

See here a small example as well. From the example on the xtabs documentation:

Base R

#example from ?xtabs
d.ergo <- data.frame(Type = paste0("T", rep(1:4, 9*4)),
                     Subj = gl(9, 4, 36*4))
> print(xtabs(~ Type + Subj, data = d.ergo)) # 4 replicates each
Subj
Type 1 2 3 4 5 6 7 8 9
  T1 4 4 4 4 4 4 4 4 4
  T2 4 4 4 4 4 4 4 4 4
  T3 4 4 4 4 4 4 4 4 4
  T4 4 4 4 4 4 4 4 4 4

ff package

#convert to ff
d.ergoff <- as.ffdf(d.ergo)

> print(xtabs(~ Type + Subj, data = d.ergoff)) # 4 replicates each
    Subj
Type 1 2 3 4 5 6 7 8 9
  T1 4 4 4 4 4 4 4 4 4
  T2 4 4 4 4 4 4 4 4 4
  T3 4 4 4 4 4 4 4 4 4
  T4 4 4 4 4 4 4 4 4 4

You can check here for more information on memory manipulation.

LyzandeR
  • 37,047
  • 12
  • 77
  • 87