How does one achieve parallel gzip compression with Python?

Question

Big file compression with python gives a very nice example on how to use e.g. bz2 to compress a very large set of files (or a big file) purely in Python.

pigz says you can do better by exploiting parallel compression. To my knowledge (and Google search) insofar I cannot find a Python equivalent to do so in pure Python code.

Is there a parallel Python implementation for pigz or equivalent?

The compression modules from the standard library aren't *pure python*. If you look into them, you'll see that they're interfaces to shared libraries (which are written in C). — Roland Smith, Mar 17 '17 at 22:31
And it's probably time to retire `gzip`. The new `zstd` compression is [generally faster](http://rsmith.home.xs4all.nl/miscellaneous/evaluating-zstandard-compression.html) than gzip and yields smaller compressed files.. — Roland Smith, Mar 17 '17 at 22:40
@RolandSmith: Of course, it doesn't have a Python interface either. It does seem faster than `gzip`, but there are many options for "compress faster". `gzip` sticks around at least in part thanks to compatibility concerns; you can decompress it on systems with 10+ year old hardware/software, and it's probably installed by default (`bz2` is almost as widespread, `xz` is getting there). For distributing data to many parties, portability and compression ratio are more important than speed. For transient compression, speed often beats compression ratio, so `lz4` or `lzo` might beat `zstd`. — ShadowRanger, Mar 17 '17 at 22:58
Basically, if you aren't bound by compatibility constraints (you can assume they have the software, and some minimal amount of RAM), you'd distributed packaged data (compressed once, decompressed many times) `xz` compressed, and use `lz4`/`snappy`/`lzo` for data that's being compressed on demand, where faster compression means the data gets there faster, with "good enough" compression. — ShadowRanger, Mar 17 '17 at 23:03
@RolandSmith yes that's true, what I meant was they will be Python code and not e.g. a shellex to some other binary on the filesystem. — Virgil Gheorghiu, Mar 17 '17 at 23:29

score 8 · Answer 1 · answered Apr 12 '20 at 01:25

mgzip is able to achieve this

Using a block indexed GZIP file format to enable compress and decompress in parallel. This implement use 'FEXTRA' to record the index of compressed member, which is defined in offical GZIP file format specification version 4.3, so it is fully compatible with normal GZIP implementation.

import mgzip

num_cpus = 0 # will use all available CPUs

with open('original_file.txt', 'rb') as original, mgzip.open(
    'gzipped_file.txt.gz', 'wb', thread=num_cpus, blocksize=2 * 10 ** 8
) as fw:
    fw.write(original.read())

I was able to speed up compression from 45min to 5min on a 72 CPUs server

Note that `mgzip` is deprecated for the newer `pgzip` from the same creator. `mgzip` hasn't gotten updates since 2020. https://github.com/pgzip/pgzip — OrderFromChaos, May 13 '22 at 23:32

score 6 · Answer 2 · answered Mar 17 '17 at 22:28

6

I don't know of a pigz interface for Python off-hand, but it might not be that hard to write if you really need it. Python's zlib module allows compressing arbitrary chunks of bytes, and the pigz man page describes the system for parallelizing the compression and the output format already.

If you really need parallel compression, it should be possible to implement a pigz equivalent using zlib to compress chunks wrapped in multiprocessing.dummy.Pool.imap (multiprocessing.dummy is the thread-backed version of the multiprocessing API, so you wouldn't incur massive IPC costs sending chunks to and from the workers) to parallelize the compression. Since zlib is one of the few built-in modules that releases the GIL during CPU-bound work, you might actually gain a benefit from thread based parallelism.

Note that in practice, when the compression level isn't turned up that high, I/O is often of similar (within order of magnitude or so) cost to the actual zlib compression; if your data source isn't able to actually feed the threads faster than they compress, you won't gain much from parallelizing.

answered Mar 17 '17 at 22:28

ShadowRanger

143,180
12
188
271

You don't have to send the chunks to the workers. Just let each worker read it's own chunks from the file. Or on UNIX you can create a memory mapped file for the input *before* creating the pool. The OS's virtual memory system will then do most of the heavy lifting to keep the pages of the input file in memory. – Roland Smith Mar 17 '17 at 22:36
@RolandSmith: True. I'm a big fan of `mmap` for all the things, and it looks like `zlib.compress` is buffer protocol friendly (that is, it can read from a `memoryview` of an `mmap` to avoid copying the data). You'd still want `imap` to coordinate the workers pulling blocks and organize the output (since the size of the compressed block can't be guessed ahead of time, you may as well serialize the writes). – ShadowRanger Mar 17 '17 at 22:43
As for coordination, I would just create a list of byte offsets for the start of each 128 kB block and `imap` over that. As for the output, I would probably have each compressed block written to a temporary output file and concatenate them later. Or maybe try `mmap` for that as well. Passing it back to the parent process *feels* suboptimal. – Roland Smith Mar 17 '17 at 22:51
@RolandSmith: That's why I suggested a thread pool, not a process pool. Passing it from a thread worker back to the main thread is fairly cheap, no copies involved. – ShadowRanger Mar 17 '17 at 23:04
It would definitely be interesting to see which of the two approaches is fastest. :-) – Roland Smith Mar 17 '17 at 23:10

score 2 · Answer 3 · answered Mar 18 '17 at 15:35

You can use the flush() operation with Z_SYNC_FLUSH to complete the last deflate block and end it on a byte boundary. You can concatenate those to make a valid deflate stream, so long as the last one you concatenate is flushed with Z_FINISH (which is the default for flush()).

You would also want to compute the CRC-32 in parallel (whether for zip or gzip -- I think you really mean parallel gzip compression). Python does not provide an interface to zlib's crc32_combine() function. However you can copy the code from zlib and convert it to Python. It will be fast enough that way, since it doesn't need to be run often. Also you can pre-build the tables you need to make it faster, or even pre-build a matrix for a fixed block length.

How does one achieve parallel gzip compression with Python?

3 Answers3

Linked