0

I have a large text file with N number of lines. Now I have to read these lines in i iterations. Which means that I have to read n = Math.floor(N/i) lines in a single iteration. Now in each iteration I have to fill a string array of n length. So the basic question is that how should I read n lines in optimum time? The simplest way to do this is to use a BufferedReader and read one line at time with BufferedReader.readLine() but it will significantly decrease performance if n is too large. Is there a way to read exactly n lines at a time?

Daksh Shah
  • 2,997
  • 6
  • 37
  • 71
Tariq
  • 2,489
  • 11
  • 36
  • 61
  • 1
    Which Language Java?, Add that tag – Daksh Shah Feb 16 '14 at 08:06
  • Please add pseudo code that you think has problem. Looks like you want to real N lines in i steps. If you read one line at a time, it should not decrease performance as long as you are using streaming file I/O. – Shital Shah Feb 16 '14 at 08:21
  • You mean *`Math.ceil(N/i)`*, right? – herohuyongtao Feb 16 '14 at 08:22
  • 1
    "it will significantly decrease performance": can you explain what you have in mind? –  Feb 16 '14 at 10:30
  • possible duplicate of [Faster way to read file](http://stackoverflow.com/questions/5854859/faster-way-to-read-file). See also [Java tip: How to read files quickly](http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly) and [Java, the fastest class to read from a txt file](http://stackoverflow.com/q/13480183/341970) – Ali Feb 16 '14 at 14:28
  • @DakshShah: I have added the tag. @YvesDaoust: I suspect that iterative function call will increase I/O time as compare to reading a bulk data in a single function call, But this call must return exactly `n` lines. – Tariq Feb 16 '14 at 14:57

2 Answers2

1

To read n lines from a text file, from a system point of view there is no other way than reading as many characters as necessary until you have seen n end-of-line delimiters (unless the file has been preprocessed to detect these, but I doubt this is allowed here).

As far as I know, no file I/O system in the world does support a function to read "until the nth occurrence of some character", nor "the n following lines" (but I am probably wrong).

If you really want to minimize the number of I/O function calls, your last resort is block I/O with which you could read a "page" at a time (say of length n times the expected or maximum line length), and detect the end-of-lines yourself.

0

I agree with Yves Daoust's answer, except for the paragraph recommending

If you really want to minimize the number of I/O function calls, your last resort is block I/O with which you could read a "page" at a time (say of length n times the expected or maximum line length), and detect the end-of-lines yourself.

There's no need to "detect the end-of-lines yourself". Something like

new BufferedReader(new InputStreamReader(is, charset), 8192);

creates a reader with a buffer of 8192 chars. The question is how useful this is for reading data in blocks. For this a byte[] is needed and there's a sun.nio.cs.StreamDecoder in between which I haven't looked into.

To be sure use

new BufferedReader(new InputStreamReader(new BufferedInputStream(is, 8192), charset));

so you get a byte[] buffer.

Note that 8192 is the default size for both BufferedReader and InputStreamReader, so leaving it out would change nothing in my above examples. Note that using much larger buffers makes no sense and can even be detrimental for performance.

Update

So far you get all the buffering needed and this should suffice. In case it doesn't, you can try:

  • to avoid the decoder overhead. When your lines are terminated by \n, you can look for (byte) \n in the file content without decoding it (unless you're using some exotic Charset).
  • to prefetch the data yourself. Normally, the OS should take care of it, so when your buffer becomes empty, Java calls into the OS, and it has the data already in memory.
  • to use a memory mapped file, so that no OS calls are needed for fetching more data (as all data "are" there when the mapping gets created).
Community
  • 1
  • 1
maaartinus
  • 44,714
  • 32
  • 161
  • 320
  • But how do you delimit the next `n` lines then ? –  Feb 18 '14 at 16:40
  • @YvesDaoust: By simply invoking `bufferedReader.readLine` `n` times in row? The first invocation fills the buffer and others work using it until more data need to be fetch. – maaartinus Feb 19 '14 at 00:48
  • Yep, that's what I call "detect the end-of-lines yourself". You need to write a loop and execute it `n` times; in the body of the loop you invoke some in-buffer `readLine` function, or some character search function such as `indexOf`, or as you say, "you can look for `(byte) \n`". And you need to handle the cases where the buffer contains less than `n` lines. –  Feb 19 '14 at 07:20
  • @YvesDaoust: That's fine, I just wanted to stress that there's some buffering anyway and that you can minimize the IO simply by using `new BufferedInputStream(is, reallyALot)`. The other optimizations go further than just minimizing IO. – maaartinus Feb 19 '14 at 07:31
  • I wanted to stress that this is exactly what I suggested. –  Feb 19 '14 at 07:41