Contents of a gzip file from a AWS S3 in Python only returning null bytes

Question

AWS Comprehend has created a file called output.tar.gz in an S3 bucket.

I am trying to load this file into memory with Python and have tried the following:

import boto3
from io import BytesIO
import gzip

s3 = boto3.client("s3")
obj = s3.get_object(Bucket=BUCKET, Key=KEY)
mycontentzip = gzip.GzipFile(fileobj=BytesIO(obj['Body'].read())).read()
lines = mycontentzip.decode("utf-8")

I've also tried the solutions on this post including no longer needing BytesIO: Reading contents of a gzip file from a AWS S3 in Python

I'm able to use these solutions to return a test file that is not .gz to be sure that I can connect to the S3 bucket correctly.

In all attempts, what is returned is a file that is only the following:

00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x...

I'm using Python=3.7.7 Boto3=1.10.5

I have also tried downloading the file manually from the AWS console. Strangely, the file unzips in MacOS 10.15.6 as a `.jsonl' file. However, it opens fine to view in VScode as a JSON.

Has anyone else had trouble with this?

Thanks in advance for any ideas.

#----------------------------------------------

UPDATE

Thanks @AKX. Tarfile it is. Found in the Docs there's a Gzip read mode in the Tarfile module: https://docs.python.org/3/library/tarfile.html

s3 = boto3.resource("s3")
obj = s3.Object(BUCKET, KEY)    
tar = tarfile.open(fileobj=BytesIO(obj.get()["Body"].read()), mode='r|gz')
tar.extractall('tmp_folder')

Tried to read the single file in the archive into memory, but it was just easier to save it to disk and read it again. I'm working with a small amount of data.

score 1 · Accepted Answer · answered Jul 29 '20 at 21:03

That's a tar.gz file, i.e. a tar archive that's been compressed with the gzip algorithm.

If you just read it with gzip.GzipFile(), you still have a binary tar archive you need to interpret.

Use the tarfile module to read it; tar archives, like zips, can contain multiple files, one of which is the .jsonl file you end up seeing.

Contents of a gzip file from a AWS S3 in Python only returning null bytes

1 Answers1