Java - Process bytes as they are being read from a file

Question

Is there a way to have one thread in java make a read call to some FileInputStream or similar and have a second thread processing the bytes being loaded at the same time? I've tried a number of things - my current attempt has one thread running this:

FileChannel inStream;
try {
   inStream = (new FileInputStream(inFile)).getChannel();
} catch (FileNotFoundException e) {
    e.printStackTrace();
}
int result;
try {
     result = inStream.read(inBuffer);
} ...

And a second thread wanting to access the bytes as they are being loaded. Clearly the read call in the first thread blocks until the buffer is full, but I want to be able to access the bytes loaded into the buffer before that point. Currently, everything I try has the buffer and it's backing array unchanged until the read completes - this not only defeats the point of this threading but also suggests the data is being loaded into some intermediate buffer somewhere and then copied into my buffer later, which seems daft.

One option would be to do a bunch of smaller reads into the array with offsets on subsequent reads, but that adds extra overhead.

Any ideas?

Another option I considered was to use PipedInput/Output streams. This seems like it will probably work, but there's added overheads from doing that - why can't I just make my Filechannel or FileInputStream "flow" into my ByteBuffer or some byte array? — Chris Kitching, Aug 02 '12 at 20:27
"Clearly the read call in the first thread blocks until the buffer is full" It shouldn't. A call to `read()` should block until data is *available*. (The IO subsystem of your OS is responsible for delivering data to stream-specific buffers.) Also, why use NIO for this use case? You don't seem to be using its features anywhere. — millimoose, Aug 02 '12 at 21:01
It doesn't 'block until the buffer is full'. It reads at least one byte from the stream, blocking until at least one byte is available. In any case as it is a file there is basically no real blocking at all. You don't need two threads to solve this problem. The second one would have to block on the first and the first has to block on the input. There is nothing to be gained. — user207421, Aug 03 '12 at 08:13

score 3 · Accepted Answer · answered Aug 02 '12 at 20:30

When you read data sequentially, the OS will read ahead the data before you need it. As the system is doing this for you already, you may not get the benefit you might expect.

why can't I just make my Filechannel or FileInputStream "flow" into my ByteBuffer or some byte array?

That is sort of what it does already.

If you want a more seamless loading of the data, you can use a memory mapped files as it "appears" in the memory of the program immediately and is loaded in the background as you use it.

score 1 · Answer 2 · answered Aug 02 '12 at 20:45

1

I would recommend to use SynchronousQueue. Reader will retrieve data from the queue and writer will "publish" the data from your file.

answered Aug 02 '12 at 20:45

yegor256

102,010
123
446
597

score 1 · Answer 3 · answered Aug 02 '12 at 21:26

What I usually do with requirements like this is to use multiple buffer class instances, preferably sized to allow efficient loading - a multiple of cluster-size, say. As soon as the first buffer gets loaded up, queue it off, (ie. push its pointer/instance onto a producer-consumer queue), to the thread that will process it and immediately create, (or depool), another buffer instance and start loading that one. To control overall data flow, you can create a suitable number of buffer objects at startup and store them in a 'pool queue', (another producer-consumer queue), and then you can circulate the objects full of data from the pool, to the file-read thread, then to the buffer-processing thread, than back to the pool.

This keeps the file->processing queue 'topped up' with buffer-objects full of data, no bulk copying required, no unavoidable delays, no inefficient inter-thread comms of single bytes, no messy locking of buffer-indexes, no chance that the file-read thread and data-processing thread can ever operate on the same buffer object.

If you want/need to use a threadPool to perform the processing, you can easily do so but you may need a sequence-number in the buffer objects if you need any resulting output from this subsystem to be in the same order as it was read from the file.

The buffer-objects may also contain result data members, exception/errorMessage fields, anything that you might want. The file and/or result data could easily be forwarded on to other thread/s from the data-processing, (eg. a logger or GUI display of progress), before getting repooled. Since it's all just pointer/instance queueing, the huge amount of data wil lflow around your system quickly and efficiently.

Thanks for the suggestion - this seems to be working well so far. Seems there is simply no way to have the bytes from an ongoing read call be accessible while the call is running without using memory mapping. — Chris Kitching, Aug 03 '12 at 16:33

score 0 · Answer 4 · answered Aug 02 '12 at 20:56

Use a PipedInput/OutputStream to create a familiar looking pipe with a buffer.?

Also use a FileInputStream to read it byte per byte if necessary. the fis.read() function will not block, it will return -1 if there is no data and you can always check for available();

Java - Process bytes as they are being read from a file

4 Answers4