I want to write a program that does multithreaded compression / decompression using NET 3.5 and GZipStream library.
The input files are very large (let's say hundreds of gigabytes)
I would like to achieve this without having any intermediate files. This was my initial approach but the requirements have changed.
I was thinking about following approaches and would like to verify if this looks good on paper:
Read from the source file and split it into constant-sized chunks in memory.
Keep track on number of threads as we have limited memory.
Each chunk is compressed in memory by separate thread.
These compressed chunks are pushed into a queue in proper order.
There is one thread that reads from the queue and concatenates it into the output file.
Also store somewhere some metadata regarding the compressed chunks that will be put later into the header. I would like to use this for decompression.
Having done the above my idea for multithreaded decompression would be then:
Read metadata file about the concatenated chunks.
Read the data from the compressed file in chunks that are defined by metadata.
Each chunk is decompressed by separate thread in memory.
These decompressed chunks are added into the queue in proper order.
There is a thread that concatenates decompressed chunks into a unified output file.
Does the above seem plausible?