Load a small random sample from a large csv file into R data frame

Question

The csv file to be processed does not fit into the memory. How can one read ~20K random lines of it to do basic statistics on the selected data frame?

One [previous answer](http://stackoverflow.com/questions/15532810/reading-40-gb-csv-file-into-r-using-bigmemory/18282037#18282037) — Martin Morgan, Mar 07 '14 at 23:03

score 25 · Answer 1 · answered Mar 07 '14 at 21:55

25

You can also just do it in the terminal with perl.

perl -ne 'print if (rand() < .01)' biglist.txt > subset.txt

This won't necessarily get you exactly 20,000 lines. (Here it'll grab about .01 or 1% of the total lines.) It will, however, be really really fast, and you'll have a nice copy of both files in your directory. You can then load the smaller file into R however you want.

answered Mar 07 '14 at 21:55

Jed

261
2
4

1

nice, any way to keep the csv header? – pomber Nov 21 '14 at 00:09
2

@pomber you could first copy the header line (e.g. `head -1 file.txt > sample.txt`) and then run the perl operation with `>>` instead to append – geotheory Mar 09 '15 at 14:45
1

Is there a way to do this with Python? – Doon_Bogan Jun 02 '15 at 09:55
For Windows you'd need to change the `'` to `"` – Hack-R Apr 21 '16 at 15:36
Tried this using a csv as the bigFile, but it copied the whole file. – Conner M. Aug 05 '18 at 01:32
I tried with windows and csv and it worked fine. Thanks! – Maxwell Chandler Jan 30 '19 at 02:42

G. Grothendieck · Answer 2 · 2017-11-05T11:42:26.453

8

Try this based on examples 6e and 6f on the sqldf github home page:

library(sqldf)
DF <- read.csv.sql("x.csv", sql = "select * from file order by random() limit 20000")

See ?read.csv.sql using other arguments as needed based on the particulars of your file.

edited Nov 05 '17 at 11:42

answered Mar 07 '14 at 23:29

G. Grothendieck

254,981
17
203
341

score 5 · Answer 3 · answered Mar 07 '14 at 21:49

5

This should work:

RowsInCSV = 10000000 #Or however many rows there are

List <- lapply(1:20000, function(x) read.csv("YourFile.csv", nrows=1, skip = sample(1, RowsInCSV), header=F)
DF = do.call(rbind, List)

answered Mar 07 '14 at 21:49

Señor O

17,049
2
45
47

is it as fast as via Perl? – P.Escondido Mar 07 '14 at 22:00
1

Doubt it. Takes about 6 seconds on my machine, so it doesn't really make a difference unless you have to do it all the time. – Señor O Mar 07 '14 at 22:50
1

could it be that the arguments in the sample function are inverted? sample(RowsInCSV, 1)? Furthermore, I think a bracket in the end of the lapply command is missing. – pascal Jan 20 '21 at 16:20

score -4 · Answer 4 · answered Mar 28 '15 at 16:44

-4

The following can be used in case you have an ID or something similar in your data. Take a sample of IDs, then take the subset of the data using the sampled ids.

sampleids <- sample(data$id,1000)
newdata <- subset(data, data$id %in% sampleids)

answered Mar 28 '15 at 16:44

Philip John

5,275
10
43
68

1

Not at all helpful if, as OP says, "The csv file to be processed does not fit into the memory". – Gregor Thomas Mar 28 '15 at 17:36

Load a small random sample from a large csv file into R data frame

4 Answers4

Linked

Related