0

I'm working with several large CSV files, large enough that I can't efficiently load them into memory.

Instead I would like to read a sample of data from each file. There have been other posts about this topic (such as Load a small random sample from a large csv file into R data frame ) but my requirements are a little different as I would like to read in the same rows from each file.

Using read.csv() with skip and nrows=1 would be very slow and tedious.

Does anyone have a suggestion for how to efficiently load the same N rows from several CSVs without reading them all into memory?

Community
  • 1
  • 1
Ellis Valentiner
  • 2,136
  • 3
  • 25
  • 36
  • See http://stackoverflow.com/a/18282037/489448 for reading the file in chunks. Now in that answer the random sample is chosen as the chunks are read; if you do the random sampling once outside the reading loops you can use the same sample for all your files. – kasterma Jul 13 '15 at 16:15
  • There is an option ```colClasses``` in ```read.csv()```. If that is not provided, then the whole column is read, and the class for each column is determined after that. So, if all the classes are known then ```read.csv()``` will comparatively be quick. Also, you can specify any integer to ```nrows``` and it'll read the specified number of rows from top. – TrigonaMinima Jul 13 '15 at 16:24

0 Answers0