0

I am looking to do, in Python 3.8, the equivalent of:

xz --decompress --stdout < hugefile.xz > hugefile.out

where neither the input nor output might fit well in memory.

As I read the documentation at https://docs.python.org/3/library/lzma.html#lzma.LZMADecompressor I could use LZMADecompressor to process incrementally available input, and I could use its decompress() function to produce output incrementally.

However it seems that LZMADecompressor puts its entire decompressed output into a single memory buffer, and decompress() reads its entire compressed input from a single input memory buffer.

Granted, the documentation confuses me as to when the input and/or output can be incremental.

So I figure I will have to spawn a separate child process to execute the "xz" binary.

Is there anyway of using the lzma Python module for this task?

  • Apart from potential portability problems (i.e., **xz** may not be installed) I'd be inclined to execute a subprocess. If the lzma module had a clearly defined streaming mechanism then that would be fine but that doesn't seem to be the case –  Sep 03 '21 at 08:15
  • Yes, I suspect you're right. Well said. – ThePythonicCow Sep 03 '21 at 16:22

1 Answers1

1

Instead of using the low-level LZMADecompressor, use lzma.open to get a file object. Then, you can copy data into an other file object with the shutil module:

import lzma
import shutil

with lzma.open("hugefile.xz", "rb") as fsrc:
    with open("hugefile.out", "wb") as fdst:
        shutil.copyfileobj(fsrc, fdst)

Internally, shutils.copyfileobj reads and write data in chunks, and the LZMA decompression is done on the fly. This avoids decompressing the whole data into memory.

rogdham
  • 193
  • 8