I'm trying to open a tar.gz file full of json data, extract the text from them, and save them back to tar.gz. Here's my code in Python 3 thus far.
from get_clean_text import get_cleaned_text # my own module
import tarfile
import os
import json
from io import StringIO
from pathlib import Path
def make_clean_gzip(inzip):
outzip = "extracted/clean-" + inzip
with tarfile.open(inzip, 'r:gz') as infile, tarfile.open(outzip, 'w:gz') as outfile:
jfiles = infile.getnames()
for j in jfiles:
dirtycase = json.loads(infile.extractfile(j).read().decode("utf-8"))
cleaned = get_cleaned_text(dirtycase)
newtarfile = tarfile.TarInfo(Path(j).stem + ".txt")
fobj = StringIO()
fobj.write(cleaned)
newtarfile.size = fobj.tell()
outfile.addfile(newtarfile, fobj)
However, this throws an OSError: unexpected end of data
. (I've verified, incidentally, that all the strings I want to write are of non-zero length, and also verified that calling tell()
on the file object returns the same value as calling len()
on the string.)
I found this prior SO, which suggested that the problem is that StringIO isn't encoded, so I swapped out BytesIO for StringIO and then fobj.write(cleaned.encode("utf-8"))
, but this still throws the same error.
I also tried simply not setting the size on the TarInfo object, and that code ran, but created an archive with a bunch of empty files.
What am I missing? Thanks!