0

I'm attempting to decompress and read a .zst file from S3 in a programmatic way (i.e. not downloading it and running command line decomp on it).

Here's the code I'm running:

import boto3
import zstandard
import os
import io

AWS_S3_BUCKET = os.getenv("AWS_S3_BUCKET")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")

zstd = zstandard.ZstdDecompressor()

s3_client = boto3.client(
    "s3",
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
)

response = s3_client.get_object(Bucket='AWS_S3_BUCKET', Key="folder/example_file_name.zst")

status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")

# decompressed = decomp.decompress(response.get("Body").read())
## OR
# with zstd.stream_reader(io.BytesIO(response.get("Body").read())) as r:
#    decompressed = r.read()

So I'm trying either of the two lines at the end separated by the "## OR". The first one tells me that it doesn't have any information about the length of the data so I tried to put in the "max_output_size=number_from_file_metadata" but it gives the same error:

ZstdError: error determining content size from frame header

And then with the "with..." statement, it gives this error:

ZstdError: zstd decompress error: Unknown frame descriptor

As far as I can tell the second error means that either the file isn't truly compressed using .zstd or it was compressed using "magicless" compression and the decompression isn't recognizing that attribute. I'm getting that from here: https://github.com/indygreg/python-zstandard/issues/79

But it's really unclear and seemingly not many people have had issues with this. Any help very much appreciated.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Redcoatwright
  • 129
  • 1
  • 5
  • 17
  • You can replace the whole upper part of your example with the content of what `response.get('Body').read()` delivers in your [mcve]. I guess, it's either not what you expect or this has nothing to do with S3 and Boto. – Ulrich Eckhardt Jul 08 '21 at 05:36

1 Answers1

1

It's not clear what you want to do after decompressing the file. If this helps, this is what I do to read a zstd file into pandas:

    resp = boto3.client('s3').get_object(Bucket=bucket, Key=key)
    data = io.BytesIO(resp['Body'].read())
    pd.read_feather(data)

I ran into a lot of issues reading zstd files because my actual writing of the file was incorrect. It's very peculiar around date formats and correct indexing when writing from a dataframe.

All that said, this might not mean anything to your situation b/c it's not clear what your files look like or what your end goal is.

Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14