0

I need to sample rows from file (file is too big to load to memory). I have this snipper using BufferedReader:

    BufferedReader br = new BufferedReader(new FileReader(filename));
    String line;
    long counter = 0; 
    while ((line = br.readLine()) != null && DocCounter < 50000) {}

How can I adjust the code to sample randomly 50000 rows from the file ? thanks

user3628777
  • 529
  • 3
  • 10
  • 20

2 Answers2

0

Try this for any random value and then change it to 5000 in your case:

String line = reader.readLine();
for (int i = 0; i < randomInt + 1; i++) {
  line = reader.readLine();
}
kiaGh
  • 36
  • 4
  • Isn't this exactly what I do in my code? I have a counter by the name "DocCounter" and when it reaches 50000 it exists the loop. However, there is nothing random here, it's just taking the first 50000 rows – user3628777 Sep 02 '14 at 11:25
  • You might want to try index file reader https://github.com/jramoyo/indexed-file-reader there is method for that readLines – kiaGh Sep 02 '14 at 11:34
0

To randomly sample 50000 lines you have to know the total number of rows in the file so you can distribute the samples across the entire file (and ensure you don't run out of lines too early).

The basic approach is to define an initial skip value

k = n/50000

where n is the total number of lines. Then loop through the file generating random numbers in the range

s = k ± e

where e is some fraction of k. At each iteration skip s lines, sample one line, then recalculate k based on the number of lines remaining after the skip. I.e. after the first iteration

t += s+1
k = (n-t)/49999

etc, updating the denominator each time. Beware of integer division boundary conditions as you get near the end of the file.

Jim Garrison
  • 85,615
  • 20
  • 155
  • 190