Read a reproductible sample of data from multiple CSVs

Asked Jul 13 '15 at 16:00

Active Jul 13 '15 at 16:00

Viewed 59 times

I'm working with several large CSV files, large enough that I can't efficiently load them into memory.

Instead I would like to read a sample of data from each file. There have been other posts about this topic (such as Load a small random sample from a large csv file into R data frame ) but my requirements are a little different as I would like to read in the same rows from each file.

Using read.csv() with skip and nrows=1 would be very slow and tedious.

Does anyone have a suggestion for how to efficiently load the same N rows from several CSVs without reading them all into memory?

edited May 23 '17 at 11:51

Community

asked Jul 13 '15 at 16:00

Ellis Valentiner

2,136
3
25
36

See http://stackoverflow.com/a/18282037/489448 for reading the file in chunks. Now in that answer the random sample is chosen as the chunks are read; if you do the random sampling once outside the reading loops you can use the same sample for all your files. – kasterma Jul 13 '15 at 16:15
There is an option ```colClasses``` in ```read.csv()```. If that is not provided, then the whole column is read, and the class for each column is determined after that. So, if all the classes are known then ```read.csv()``` will comparatively be quick. Also, you can specify any integer to ```nrows``` and it'll read the specified number of rows from top. – TrigonaMinima Jul 13 '15 at 16:24

Read a reproductible sample of data from multiple CSVs

0 Answers0