I have a very large (11GB) .json file (yeah, whoever thought that a great idea?) that I need to sample (read k
random lines).
I'm not very savvy in Java file IO but I have, of course, found this post: How to get a random line of a text file in Java?
I'm dropping the accepted answer because it's clearly way too slow to read every single line of an 11GB file just to select one (or rather k
) out of the about 100k lines.
Fortunately, there is a second suggestion posted there that I think might be of better use to me:
Use RandomAccessFile to seek to a random byte position in the file.
Seek left and right to the next line terminator. Let L the line between them.
With probability (MIN_LINE_LENGTH / L.length) return L. Otherwise, start over at step 1.
So far so good, but I was wondering about that "let L be the line between them".
I would have done something like this (untested):
RandomAccessFile raf = ...
long pos = ...
String line = getLine(raf,pos);
...
where
private String getLine(RandomAccessFile raf, long start) throws IOException{
long pos = (start % 2 == 0) ? start : start -1;
if(pos == 0) return raf.readLine();
do{
pos -= 2;
raf.seek(pos);
}while(pos > 0 && raf.readChar() != '\n');
pos = (pos <= 0) ? 0 : pos + 2;
raf.seek(pos);
return raf.readLine();
}
and then operated with line.length()
, which forgoes the need to explicitly seek the right end of the line.
So why "seek left and right to the next line terminator"? Is there a more convenient way to get the line from these two offsets?