Append a folder to gzip in memory using python

Question

I have a tar.gz file downloaded from s3, I load it in memory and I want to add a folder and eventually write it into another s3.
I've been trying different approaches:

from io import BytesIO
import gzip
buffer = BytesIO(zip_obj.get()["Body"].read())
im_memory_tar = tarfile.open(buffer, mode='a')

The above rises the error: ReadError: invalid header .

With the below approach:

im_memory_tar = tarfile.open(fileobj=buffer, mode='a')
im_memory_tar.add(name='code_1', arcname='code')

The content seems to be overwritten.
Do you know a good solution to append a folder into a tar.gz file?
Thanks.

score 1 · Accepted Answer · answered Jan 08 '21 at 13:13

very well explained in question how-to-append-a-file-to-a-tar-file-use-python-tarfile-module

Note that 'a:gz' or 'a:bz2' is not possible. If mode is not suitable to open a certain (compressed) file for reading, ReadError is raised. Use mode 'r' to avoid this. If a compression method is not supported, CompressionError is raised.

score 1 · Answer 2 · answered Jan 08 '21 at 19:15

First we need to consider how to append to a tar file. Let's set aside the compression for a moment.

A tar file is terminated by two 512-byte blocks of all zeros. To add more entries, you need to remove or overwrite that 1024 bytes at the end. If you then append another tar file there, or start writing a new tar file there, you will have a single tar file with all of the entries of the original two.

Now we return to the tar.gz. You can simply decompress the entire .gz file, do that append as above, and then recompress the whole thing.

Avoiding the decompression and recompression is rather more difficult, since we'd have to somehow remove that last 1024 bytes of zeros from the end of the compressed stream. It is possible, but you would need some knowledge of the internals of a deflate compressed stream.

A deflate stream consists of a series of compressed data "blocks", which are each an arbitrary number of bits long. You would need to decompress, without writing out the result, until you get to the block containing the last 1024 bytes. You would need to save the decompressed result of that and any subsequent blocks, and at what bit in the stream that block started. Then you could recompress that data, sans the last 1024 bytes, starting at that byte.

Complete the compression, and write out the gzip trailer with the 1024 zeros removed from the CRC and length. (There is a way to back out zeros from the CRC.) Now you have a complete gzip stream for the previous .tar.gz file, but with the last 1024 bytes of zeros removed.

Since the concatenation of two gzip streams is itself a valid gzip stream, you can now concatenate the second .tar.gz file directly or start writing a new .tar.gz stream there. You now have a single, valid .tar.gz stream with the entries from the two original sources.

Append a folder to gzip in memory using python

2 Answers2