1
  • I want to write a program that does multithreaded compression / decompression using NET 3.5 and GZipStream library.

  • The input files are very large (let's say hundreds of gigabytes)

  • I would like to achieve this without having any intermediate files. This was my initial approach but the requirements have changed.

I was thinking about following approaches and would like to verify if this looks good on paper:

  1. Read from the source file and split it into constant-sized chunks in memory.

  2. Keep track on number of threads as we have limited memory.

  3. Each chunk is compressed in memory by separate thread.

  4. These compressed chunks are pushed into a queue in proper order.

  5. There is one thread that reads from the queue and concatenates it into the output file.

  6. Also store somewhere some metadata regarding the compressed chunks that will be put later into the header. I would like to use this for decompression.

Having done the above my idea for multithreaded decompression would be then:

  1. Read metadata file about the concatenated chunks.

  2. Read the data from the compressed file in chunks that are defined by metadata.

  3. Each chunk is decompressed by separate thread in memory.

  4. These decompressed chunks are added into the queue in proper order.

  5. There is a thread that concatenates decompressed chunks into a unified output file.

Does the above seem plausible?

3 Answers3

1

I don't think that GZip can be broken up thus way. The whole stream depends on some token dictionary (Huffman tree or a variation) at the start. As a hint, GZipStream.CanSeek() always returns false.

So your point 3. would fail - the chunks are not independent.

What might work is to process 2 or even 3 files in parallel, depending on you I/O hardware. More suited for a fast SSD than for an older HDD. Network I/O usually qualifies as a slow HDD.

bommelding
  • 2,969
  • 9
  • 14
  • Even if I open a separate MemoryStream for each of those threads and then apply GZipStream on top of it? – Radoslaw Jurewicz May 08 '18 at 13:29
  • Oh yes, I did miss point 5 and 6. That could work, but with less compression. You'd have to experiment a lot with the chunk size and the no of threads. – bommelding May 08 '18 at 13:33
1

Yes, when you treat every chunk as an independent item (it gets it own GZip stream) this should work. But it would add some overhead, your overall compression will be a bit lower.

For each chunk you would need the size and the sequence number to deserialize and resequence.
The receiver would have to resequence anyway so you could skip that on the sender.

But it's hard to estimate how much you would gain by this, the compression is a little CPU intensive but still much faster than most I/O devices.

bommelding
  • 2,969
  • 9
  • 14
  • I would be fine with overall compression being a bit lower. I am already operating on huge files and I core mostly about multithreading rather than saving couple of % of overall size here or there. – Radoslaw Jurewicz May 08 '18 at 14:06
  • Yes, but don't expect too much speedup from multithreading. The I/O is your bottleneck. – bommelding May 08 '18 at 14:08
  • I am afraid that I will be I/O bound as you say. Would you perhaps have any better idea that does not involve doing intermediate files? I know that when I did intermediate file approach (i.e. I was creating a large number of chunks on harddrive that I was later concatenating) I had pretty hefty boost over single threaded compression. – Radoslaw Jurewicz May 08 '18 at 14:10
  • "pretty hefty boost" - That could have had multiple causes - I can't quite figure it out. The scheme above here is still worth a try - do a simple implementation and measure. – bommelding May 08 '18 at 14:32
  • I measured 2 following approaches: **Approach 1** - Take the file. Split it into chunks on harddrive. Compress each chunk on separate thread. Concatenate chunks. **Approach 2** - Compress the file in single thread. Approach 1 was on average taking 50-60% of the time that approach 2 needed. – Radoslaw Jurewicz May 09 '18 at 07:03
  • OK, good to know but can you mention the disk used? And you could post a self-answer here to document this for people with the same problem. – bommelding May 09 '18 at 07:27
  • It was normal HDD, if I recall correctly SATA HDD 500 GBs. If I manage to implement what I planned I will definitely update the post, thank you for the suggestion – Radoslaw Jurewicz May 09 '18 at 08:23
1

Sure, that will work fine. As it happens, a concatenation of valid gzip files is also a valid gzip file. Each distinct decompressible stream is called a gzip member. Your metadata just needs the offset in the file for the start of each stream.

The extra block of a gzip header is limited to 64K bytes, so this may limit how small a chunk can be, e.g. on the order of tens to a hundred megabytes. For another reason, I would recommend that your chunks of data to compress be at least several megabytes each anyway -- in order to avoid a reduction in compression effectiveness.

A downside of concatenation is that you get no overall check on the integrity of the input. For example, if you mess up the order of the members somehow, this will not be detected on decompression, since each member's integrity check will pass regardless of the order. So you may want to include an overall check for the uncompressed data. An example would be the CRC of the entire uncompressed data, which can be computed from the CRCs of the members using zlib's crc32_combine().

I would interested to know if in your case you get a significant speedup from parallel decompression. The decompression is usually fast enough that it is I/O bound on the mass storage device being read from.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Hey Mark, thank you for your help again! I will do my best. If I manage to implement this I will be happy to share the results here with you. – Radoslaw Jurewicz May 09 '18 at 07:06