The csv file to be processed does not fit into the memory. How can one read ~20K random lines of it to do basic statistics on the selected data frame?
Asked
Active
Viewed 1.8k times
4 Answers
25
You can also just do it in the terminal with perl.
perl -ne 'print if (rand() < .01)' biglist.txt > subset.txt
This won't necessarily get you exactly 20,000 lines. (Here it'll grab about .01 or 1% of the total lines.) It will, however, be really really fast, and you'll have a nice copy of both files in your directory. You can then load the smaller file into R however you want.

Jed
- 261
- 2
- 4
-
1
-
2@pomber you could first copy the header line (e.g. `head -1 file.txt > sample.txt`) and then run the perl operation with `>>` instead to append – geotheory Mar 09 '15 at 14:45
-
1
-
-
Tried this using a csv as the bigFile, but it copied the whole file. – Conner M. Aug 05 '18 at 01:32
-
8
Try this based on examples 6e and 6f on the sqldf github home page:
library(sqldf)
DF <- read.csv.sql("x.csv", sql = "select * from file order by random() limit 20000")
See ?read.csv.sql
using other arguments as needed based on the particulars of your file.

G. Grothendieck
- 254,981
- 17
- 203
- 341
5
This should work:
RowsInCSV = 10000000 #Or however many rows there are
List <- lapply(1:20000, function(x) read.csv("YourFile.csv", nrows=1, skip = sample(1, RowsInCSV), header=F)
DF = do.call(rbind, List)

Señor O
- 17,049
- 2
- 45
- 47
-
-
1Doubt it. Takes about 6 seconds on my machine, so it doesn't really make a difference unless you have to do it all the time. – Señor O Mar 07 '14 at 22:50
-
1could it be that the arguments in the sample function are inverted? sample(RowsInCSV, 1)? Furthermore, I think a bracket in the end of the lapply command is missing. – pascal Jan 20 '21 at 16:20
-4
The following can be used in case you have an ID or something similar in your data. Take a sample of IDs, then take the subset of the data using the sampled ids.
sampleids <- sample(data$id,1000)
newdata <- subset(data, data$id %in% sampleids)

Philip John
- 5,275
- 10
- 43
- 68
-
1Not at all helpful if, as OP says, "The csv file to be processed does not fit into the memory". – Gregor Thomas Mar 28 '15 at 17:36