0

I'm working on a Java program where I'm reading from a file in dynamic, unknown blocks. That is, each block of data will not always be the same size and the size is determined as data is being read. For I/O I'm using a MappedByteBuffer (the file inputs are on the order of MB).

My goal:

  • Find an efficient way to store each complete block during the input phase so that I can process it.

My constraints:

  • I am reading one byte at a time from the buffer
  • My processing method takes a primitive byte array as input
  • Each block gets processed before the next block is read

What I've tried:

  • I played around with dynamic structures like Lists but they don't have backing arrays and the conversion time to a primitive array concerns me
  • I also thought about using a String to store each block and then getBytes() to get the byte[], but it's so slow
  • Reading the file multiple times in order to find the block size first, and then grab the relevant bytes

I am trying to find an approach that doesn't defeat the purpose of fast I/O. Any advice would be greatly appreciated.

Additional Info:

  • I'm using a rolling hash to decide where blocks should end

Here's a bit of pseudo-code:

circular_buffer[] = read first 128 bytes
rolling_hash = hash(buffer[])
block_storage = ??? // this is the data structure I'd like to use
while file has more text
    b = next byte
    add b to block_storage
    add b to next index in circular_buffer (if reached end, start adding/overwriting front)
    shift rolling_hash one byte to the right
    if hash has a certain characteristic
        process block_storage as a byte[] //should contain entire block of data

As you can see, I'm reading one byte at a time, and storing/overwriting that one byte repeatedly. However, once I get to the processing stage, I want to be able to access all of the info in the block. There is no predetermined max size of a block either, so I can't pre-allocate.

marcman
  • 3,233
  • 4
  • 36
  • 71
  • You are reading blocks as MappedByteBuffer. Each block gets processed before the next block is read. You want to store the blocks so that they can be processed. OK. But aren't they already "stored" when you have them as a MappedByteBuffer? The intention is unclear for me, maybe some (pseudo-) code that shows how you would like to use this data structure may be helpful... – Marco13 Mar 17 '14 at 10:56

1 Answers1

1

It seems to me, that you reqire a dynamically growing buffer. You can use the built in BytaArrayOutputStream to achieve that. It will automatically grow to store all data written to it. You can use write(int b) and toByteArray() to realize add b to block_storage and process block_storage as a byte[].

But take care - this stream will grow unbounded. You should implement some sanity checks around it to avoid using up all memory (e.g. count bytes written to it and break by throwing an exception, when it exceeds an reasonable amount). Also make sure to close and throw away the reference to a stream after consuming the block, to allow the GC to free up memory.

edit: As @marcman pointed out, the buffer can be reset().

Pyranja
  • 3,529
  • 22
  • 24
  • I will be clearing each block after processing. I imagine there's a way to clear the buffer? – marcman Mar 17 '14 at 13:58
  • I'm gonna try this. Seems reasonable. I didn't realize that streams were also buffers. I initially tried a straight up ByteBuffer before realize it's an abstract class. I'll update my post after testing. Thanks! – marcman Mar 17 '14 at 14:00
  • ByteArrayOutputStream is just one implementation of a stream, which writes to a byte array instead of a file/socket/whatever. For clearing, there is afaik no way to clear the baos, but I would just create a new (therefore empty) one and throw the old one away. The GC will free the used memory. – Pyranja Mar 17 '14 at 14:39
  • Works great! Thanks for the help! (Also there is baos.reset() which is suitable for clearing the stream) – marcman Mar 17 '14 at 19:04
  • Ah, good catch! certainly more efficient than recreating it all the time. – Pyranja Mar 17 '14 at 21:02