This is a pure R solution to the challenge of sampling from a large text file; it has the additional merit of drawing a random sample of exactly n. It is not too inefficient, though lines are parsed to character vectors and this is relatively slow.
We start with a function signature, where we provide a file name, the size of the sample we want to draw, a seed for the random number generator (so that we can reproduce our random sample!), an indication of whether there's a header line, and then a "reader" function that we'll use to parse the sample into the object seen by R, including additional arguments ...
that the reader function might need
fsample <-
function(fname, n, seed, header=FALSE, ..., reader=read.csv)
{
The function seeds the random number generator, opens a connection, and reads in the (optional) header line
set.seed(seed)
con <- file(fname, open="r")
hdr <- if (header) {
readLines(con, 1L)
} else character()
The next step is to read in a chunk of n lines, initializing a counter of the total number of lines seen
buf <- readLines(con, n)
n_tot <- length(buf)
Continue to read in chunks of n lines, stopping when there is no further input
repeat {
txt <- readLines(con, n)
if ((n_txt <- length(txt)) == 0L)
break
For each chunk, draw a sample of n_keep
lines, with the number of lines proportional to the fraction of total lines in the current chunk. This ensures that lines are sampled uniformly over the file. If there are no lines to keep, move to the next chunk.
n_tot <- n_tot + n_txt
n_keep <- rbinom(1, n_txt, n_txt / n_tot)
if (n_keep == 0L)
next
Choose the lines to keep, and the lines to replace, and update the buffer
keep <- sample(n_txt, n_keep)
drop <- sample(n, n_keep)
buf[drop] <- txt[keep]
}
When data input is done, we parse the result using the reader and return the result
reader(textConnection(c(hdr, buf), header=header, ...)
}
The solution could be made more efficient, but a bit more complicated, by using readBin
and searching for line breaks as suggested by Simon Urbanek on the R-devel mailing list.
Here's the full solution
fsample <-
function(fname, n, seed, header=FALSE, ..., reader = read.csv)
{
set.seed(seed)
con <- file(fname, open="r")
hdr <- if (header) {
readLines(con, 1L)
} else character()
buf <- readLines(con, n)
n_tot <- length(buf)
repeat {
txt <- readLines(con, n)
if ((n_txt <- length(txt)) == 0L)
break
n_tot <- n_tot + n_txt
n_keep <- rbinom(1, n_txt, n_txt / n_tot)
if (n_keep == 0L)
next
keep <- sample(n_txt, n_keep)
drop <- sample(n, n_keep)
buf[drop] <- txt[keep]
}
reader(textConnection(c(hdr, buf)), header=header, ...)
}