18

The title is pretty self explanatory here but I will elaborate as follows. Some of my current techniques in attacking this problem are based on the solutions presented in this question. However, I am facing several challenges and constraints so I was wondering if someone might attempt to take a stab at this problem. I am trying to figure out the problem using the bigmemory package but I have been running into difficulties.

Present Constraints:

  • Using a linux server with 16 GB of RAM
  • Size of 40 GB CSV
  • No of rows: 67,194,126,114

Challenges

  • Need to be able to randomly sample smaller datasets (5-10 Million rows) from a big.matrix or equivalent data structure.
  • Need to be able to remove any row with a single instance of NULL while parsing into a big.matrix or equivalent data structure.

So far, results are not good. Evidently, I am failing at something or maybe, I just don't understand the bigmemory documentation well enough. So, I thought I would ask here to see if anyone has used

Any tips, advice on this line of attack etc.? Or should I change to something else? I apologize if this question is very similar to the previous but I thought by scale of data was about 20 times bigger than the previous questions. Thanks !

Community
  • 1
  • 1
Shion
  • 395
  • 1
  • 3
  • 13
  • 1
    How about a sample of the file contents? – Joshua Ulrich Mar 20 '13 at 19:23
  • Where exactly are you failing? What kind of data are in the .csv file -- is it all `double`s, `int`s or otherwise? How are `NULL` entries represented in the file? Are there row/column names? And, what have you tried? Given a .csv of appropriate structure, `read.big.matrix` should get you there. – Kevin Ushey Mar 20 '13 at 19:29
  • More info would be good, but why not import it into SQL, do some preparation there and then load it into R? – Manoel Galdino Mar 20 '13 at 20:08
  • thanks for the suggestions. Let me look at my data and again and get back to you guys on my issue. – Shion Mar 20 '13 at 21:48
  • I would suggest looking at the ff package. You would be writing the data to disk instead of memory. – larrydag Mar 28 '13 at 17:01

2 Answers2

21

I don't know about bigmemory, but to satisfy your challenges you don't need to read the file in. Simply pipe some bash/awk/sed/python/whatever processing to do the steps you want, i.e. throw out NULL lines and randomly select N lines, and then read that in.

Here's an example using awk (assuming you want 100 random lines from a file that has 1M lines).

read.csv(pipe('awk -F, \'BEGIN{srand(); m = 100; length = 1000000;}
                       !/NULL/{if (rand() < m/(length - NR + 1)) {
                                 print; m--;
                                 if (m == 0) exit;
                              }}\' filename'
        )) -> df

It wasn't obvious to me what you meant by NULL, so I used literal understanding of it, but it should be easy to modify it to fit your needs.

eddi
  • 49,088
  • 6
  • 104
  • 155
  • 1
    This is actually a very good answer and I had solved my problem sometime back by implementing a very similar solution. Thank you for this answer. I will accept this. – Shion Apr 04 '13 at 01:15
14

This is a pure R solution to the challenge of sampling from a large text file; it has the additional merit of drawing a random sample of exactly n. It is not too inefficient, though lines are parsed to character vectors and this is relatively slow.

We start with a function signature, where we provide a file name, the size of the sample we want to draw, a seed for the random number generator (so that we can reproduce our random sample!), an indication of whether there's a header line, and then a "reader" function that we'll use to parse the sample into the object seen by R, including additional arguments ... that the reader function might need

fsample <-
    function(fname, n, seed, header=FALSE, ..., reader=read.csv)
{

The function seeds the random number generator, opens a connection, and reads in the (optional) header line

    set.seed(seed)
    con <- file(fname, open="r")
    hdr <- if (header) {
        readLines(con, 1L)
    } else character()

The next step is to read in a chunk of n lines, initializing a counter of the total number of lines seen

    buf <- readLines(con, n)
    n_tot <- length(buf)

Continue to read in chunks of n lines, stopping when there is no further input

    repeat {
        txt <- readLines(con, n)
        if ((n_txt <- length(txt)) == 0L)
            break

For each chunk, draw a sample of n_keep lines, with the number of lines proportional to the fraction of total lines in the current chunk. This ensures that lines are sampled uniformly over the file. If there are no lines to keep, move to the next chunk.

        n_tot <- n_tot + n_txt
        n_keep <- rbinom(1, n_txt, n_txt / n_tot)
        if (n_keep == 0L)
            next

Choose the lines to keep, and the lines to replace, and update the buffer

        keep <- sample(n_txt, n_keep)
        drop <- sample(n, n_keep)
        buf[drop] <- txt[keep]
    }

When data input is done, we parse the result using the reader and return the result

    reader(textConnection(c(hdr, buf), header=header, ...)
}

The solution could be made more efficient, but a bit more complicated, by using readBin and searching for line breaks as suggested by Simon Urbanek on the R-devel mailing list. Here's the full solution

fsample <-
    function(fname, n, seed, header=FALSE, ..., reader = read.csv)
{
    set.seed(seed)
    con <- file(fname, open="r")
    hdr <- if (header) {
        readLines(con, 1L)
    } else character()

    buf <- readLines(con, n)
    n_tot <- length(buf)

    repeat {
        txt <- readLines(con, n)
        if ((n_txt <- length(txt)) == 0L)
            break

        n_tot <- n_tot + n_txt
        n_keep <- rbinom(1, n_txt, n_txt / n_tot)
        if (n_keep == 0L)
            next

        keep <- sample(n_txt, n_keep)
        drop <- sample(n, n_keep)
        buf[drop] <- txt[keep]
    }

    reader(textConnection(c(hdr, buf)), header=header, ...)
}
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • Thank you for posting your code, and thank you for the excellent documentation. Would you happen to be able to point me towards and example using `readBin`? Thanks! – Zach May 25 '14 at 17:04