5

I have a huge file of coordinates about 125 million lines. I want to sample these lines to obtain say 1% of all the lines so that I can plot them. Is there a way to do this in R? The file is very simple, it has only 3 columns, and I am only interested in first two. A sample of the file would be as follows:

1211 2234
1233 2348
.
.
.

Any help / pointer is highly appreciated.

Sam
  • 7,922
  • 16
  • 47
  • 62
  • 1
    I think you want this - http://stackoverflow.com/a/15798275/817778 – eddi Sep 09 '13 at 19:45
  • 1
    or the [other answer to the same question](http://stackoverflow.com/questions/15532810/reading-40-gb-csv-file-into-r-using-bigmemory/18282037#18282037), which is a pure R solution – Martin Morgan Sep 09 '13 at 20:04

4 Answers4

4

If you have a fixed sample size that you want to select and you do not know ahead of time how many rows the file has, then here is some sample code that will result in a simple random sample of the data without storing the whole dataset in memory:

n <- 1000
con <- file("jan08.csv", open = "r")
head <- readLines(con, 1)
sampdat <- readLines(con, n)
k <- n
while (length(curline <- readLines(con, 1))) {
    k <- k + 1
    if (runif(1) < n/k) {
        sampdat[sample(n, 1)] <- curline
    }
}
close(con)
delaysamp <- read.csv(textConnection(c(head, sampdat)))

If you are working with the large dataset more than just the once then it may be better to read the data into a database, then sample from there.

The ff package is another option for storing a large data object in a file, but being able to grab parts of it within R in a simple manner.

Greg Snow
  • 48,497
  • 6
  • 83
  • 110
2

LaF package and sample_line command is one option to read sample from the file:

datafile <- "file.txt" # file from working directory
sample_line(datafile, length(datafile)/100) # this give 1 % of lines 

More about sample_line: https://rdrr.io/cran/LaF/man/sample_lines.html

vtenhunen
  • 49
  • 4
1

As far as I undertood your question, this could be helpful

> set.seed(1)
> big.file <- matrix(rnorm(1e3, 100, 3), ncol=2) # simulating your big data
> 
> 
> # choosing 1% randomly
> one.percent <- big.file[sample(1:nrow(big.file), 0.01*nrow(big.file)), ]
          [,1]      [,2]
[1,]  99.40541 106.50735
[2,]  98.44774  98.53949
[3,] 101.50289 102.74602
[4,]  96.24013 104.97964
[5,] 101.67546 102.30483

Then you can plot it

>  plot(one.percent)
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
0

If you don't want to read the file into R, something like this?

mydata<-matrix(nrow=1250000,ncol=2)  # assuming 2 columns in your source file
for (j in 1:1250000) mydata[j,] <- scan('myfile',skip= j*100 -1,nlines=1)

plus whatever arguments you may need for the datatype in your file, noheader, etc. And if you don't want evenly spaced samples, you'll need to generate (for 1% of 125 million) 1.25 million integer values randomly selected over 1:1.25e8 .

EDIT: my apologies - I neglected to put the nlines=1 argument in there.

Carl Witthoft
  • 20,573
  • 9
  • 43
  • 73
  • do note, that you'll keep scanning (ever growing bits of) the file over and over by doing this and I wouldn't be surprised if this took longer than reading the entire file in for some sizes – eddi Sep 09 '13 at 20:06
  • Correct me if I'm wrong, but you cannot specify non-contiguous lines to be read in `scan()`. – Ferdinand.kraft Sep 09 '13 at 20:10
  • @Ferdinand.kraft my bad-see edits to add `nlines=1` . Granted this may be slow - probably should `open` the file and keep it open until done. – Carl Witthoft Sep 09 '13 at 21:01