0

Im trying to read and write tar.gz files from memory using python. I have read over the relevant python docs and have come up with the following minimum working example to demonstrate my issue.

text = "This is a test."
file_name = "test.txt"

text_buffer = io.BytesIO()
text_buffer.write(text.encode(encoding="utf-8"))

tar_buffer = io.BytesIO()

# Start a tar file with the memory buffer as the "file".
with tarfile.open(fileobj=tar_buffer, mode="w:gz") as archive:   

    # We must create a TarInfo object for each file we put into the tar file.
    info = tarfile.TarInfo(file_name)
    text_buffer.seek(0, io.SEEK_END)
    info.size = text_buffer.tell()

    # We have to reset the data frame buffer as tarfile.addfile doesn't do this for us.
    text_buffer.seek(0, io.SEEK_SET)

    # Add the text to the tarfile.
    archive.addfile(info, text_buffer)


with open("test.tar.gz", "wb") as f:
    f.write(tar_buffer.getvalue())

# The following command works fine.
# tar -zxvf test.tar.gz 

archive_contents = dict()

# Start a tar file with the memory buffer as the "file".
with tarfile.open(fileobj=tar_buffer, mode="r:*") as archive:

    for entry in archive:
        entry_fd = archive.extractfile(entry.name)
        archive_contents[entry.name] = entry_fd.read().decode("utf-8")

The odd thing is that extracting the archive with the tar command works completely fine. I see a file test.txt containing the string This is a test..

However for entry in archive immediately finishes as it seems there are no files in the archive. archive.getmembers() returns an empty list.

One other odd issue is when I set mode="r:gz" when opening the byte stream I get the following exception

Exception has occurred: ReadError
empty file
tarfile.EmptyHeaderError: empty header

During handling of the above exception, another exception occurred:

  File ".../test.py", line 283, in <module>
    with tarfile.open(fileobj=tar_buffer, mode="r:gz") as archive:
tarfile.ReadError: empty file

I have also tried creating a test.tar.gz file using the tar command (assuming that they may be some issue in the way I was writing the tar file), but I get the same exception.

I must be missing something basic, but I can't seem to find any examples of this online.

  • 1
    You need to reset the position of the buffer to the beginning before you can extract the files because after writing to the tar_buffer, its position is at the end of the file. Therefore, when you try to read from it, there are no files to extract. – Abdulmajeed Feb 24 '23 at 18:46
  • 1
    You are amazing, so obvious in hindsight. Please post as answer so I can accept. – python-cat-1023 Feb 24 '23 at 18:51

1 Answers1

1

You need to reset the position of the buffer to the beginning before you can extract the files because after writing to the tar_buffer, its position is at the end of the file. Therefore, when you try to read from it, there are no files to extract

 with open("test.tar.gz", "wb") as f:
        f.write(tar_buffer.getvalue())
        tar_buffer.seek(0)
    
       archive_contents = dict()
Abdulmajeed
  • 1,502
  • 2
  • 10
  • 13