I'm working on a Java program where I'm reading from a file in dynamic, unknown blocks. That is, each block of data will not always be the same size and the size is determined as data is being read. For I/O I'm using a MappedByteBuffer (the file inputs are on the order of MB).
My goal:
- Find an efficient way to store each complete block during the input phase so that I can process it.
My constraints:
- I am reading one byte at a time from the buffer
- My processing method takes a primitive byte array as input
- Each block gets processed before the next block is read
What I've tried:
- I played around with dynamic structures like Lists but they don't have backing arrays and the conversion time to a primitive array concerns me
- I also thought about using a String to store each block and then getBytes() to get the byte[], but it's so slow
- Reading the file multiple times in order to find the block size first, and then grab the relevant bytes
I am trying to find an approach that doesn't defeat the purpose of fast I/O. Any advice would be greatly appreciated.
Additional Info:
- I'm using a rolling hash to decide where blocks should end
Here's a bit of pseudo-code:
circular_buffer[] = read first 128 bytes
rolling_hash = hash(buffer[])
block_storage = ??? // this is the data structure I'd like to use
while file has more text
b = next byte
add b to block_storage
add b to next index in circular_buffer (if reached end, start adding/overwriting front)
shift rolling_hash one byte to the right
if hash has a certain characteristic
process block_storage as a byte[] //should contain entire block of data
As you can see, I'm reading one byte at a time, and storing/overwriting that one byte repeatedly. However, once I get to the processing stage, I want to be able to access all of the info in the block. There is no predetermined max size of a block either, so I can't pre-allocate.