AWS Comprehend has created a file called output.tar.gz
in an S3 bucket.
I am trying to load this file into memory with Python and have tried the following:
import boto3
from io import BytesIO
import gzip
s3 = boto3.client("s3")
obj = s3.get_object(Bucket=BUCKET, Key=KEY)
mycontentzip = gzip.GzipFile(fileobj=BytesIO(obj['Body'].read())).read()
lines = mycontentzip.decode("utf-8")
I've also tried the solutions on this post including no longer needing BytesIO: Reading contents of a gzip file from a AWS S3 in Python
I'm able to use these solutions to return a test file that is not .gz
to be sure that I can connect to the S3 bucket correctly.
In all attempts, what is returned is a file that is only the following:
00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x...
I'm using Python=3.7.7 Boto3=1.10.5
I have also tried downloading the file manually from the AWS console. Strangely, the file unzips in MacOS 10.15.6 as a `.jsonl' file. However, it opens fine to view in VScode as a JSON.
Has anyone else had trouble with this?
Thanks in advance for any ideas.
#----------------------------------------------
UPDATE
Thanks @AKX. Tarfile it is. Found in the Docs there's a Gzip read mode in the Tarfile module: https://docs.python.org/3/library/tarfile.html
s3 = boto3.resource("s3")
obj = s3.Object(BUCKET, KEY)
tar = tarfile.open(fileobj=BytesIO(obj.get()["Body"].read()), mode='r|gz')
tar.extractall('tmp_folder')
Tried to read the single file in the archive into memory, but it was just easier to save it to disk and read it again. I'm working with a small amount of data.