0

I'm trying to open a tar.gz file full of json data, extract the text from them, and save them back to tar.gz. Here's my code in Python 3 thus far.

from get_clean_text import get_cleaned_text # my own module
import tarfile
import os
import json
from io import StringIO
from pathlib import Path


def make_clean_gzip(inzip):
    outzip = "extracted/clean-" + inzip
    with tarfile.open(inzip, 'r:gz') as infile, tarfile.open(outzip, 'w:gz') as outfile:
        jfiles = infile.getnames()
        for j in jfiles:
            dirtycase = json.loads(infile.extractfile(j).read().decode("utf-8"))
            cleaned = get_cleaned_text(dirtycase)
            newtarfile = tarfile.TarInfo(Path(j).stem + ".txt")
            fobj = StringIO()
            fobj.write(cleaned)
            newtarfile.size = fobj.tell()
            outfile.addfile(newtarfile, fobj)

However, this throws an OSError: unexpected end of data. (I've verified, incidentally, that all the strings I want to write are of non-zero length, and also verified that calling tell() on the file object returns the same value as calling len() on the string.)

I found this prior SO, which suggested that the problem is that StringIO isn't encoded, so I swapped out BytesIO for StringIO and then fobj.write(cleaned.encode("utf-8")), but this still throws the same error.

I also tried simply not setting the size on the TarInfo object, and that code ran, but created an archive with a bunch of empty files.

What am I missing? Thanks!

Paul Gowder
  • 2,409
  • 1
  • 21
  • 36

1 Answers1

3

The .addfile() method presumably just calls .read() on the file object you give it - which returns nothing in this case, because you're already at the end of the file. Try adding fobj.seek(0) just before that line.

jasonharper
  • 9,450
  • 2
  • 18
  • 42