Read tarfile in as bytes

Question

I have a setup in AWS where I have a python lambda proxying an s3 bucket containing .tar.gz files. I need to return the .tar.gz file from the python lambda back through the API to the user.

I do not want to untar the file, I want to return the tarfile as is, and it seems the tarfile module does not support reading in as bytes.

I have tried using python's .open method (which returns a codec error in utf-8). Then codecs.open with errors set to both ignore and replace which leads to the resulting file not being recognized as .tar.gz

Implementation (tar binary unpackaging)

try:
    data = client.get_object(Bucket=bucket, Key=key)
    headers['Content-Type'] = data['ContentType']
    if key.endswith('.tar.gz'):
        with open('/tmp/tmpfile', 'wb') as wbf:
            bucketobj.download_fileobj(key, wbf)
        with codecs.open('/tmp/tmpfile', "rb",encoding='utf-8', errors='ignore') as fdata:
            body = fdata.read()
        headers['Content-Disposition'] = 'attachment; filename="{}"'.format(key.split('/')[-1])

Usage (package/aws information redacted for security)

$ wget -v https://<apigfqdn>/release/simple/<package>/<package>-1.0.4.tar.gz
$ tar -xzf <package>-1.0.4.tar.gz 

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Why not just generate an S3 pre-signed URL for that file and return that URL instead of trying to proxy the binary data through the Lambda function? — Mark B, May 10 '18 at 17:55
@MarkB Because I wanted to leverage the client certificates available in API Gateway since that made the client-auth security easiest from an implementation standpoint for my end-use case. There are other ways to accomplish this (e.g. through user/pass auth on an EC2 instance w/ nginx), but I've been running with this solution for a while and want to see it through. Also the end use case is for pip so it expects to be able to download from a formatted url — asdf, May 10 '18 at 17:56
You don't need to use the `tarfile` module at all since you don't want to untar the file. Just treat it as any other binary file, and use standard boto3 methods of reading binary data from S3. — Mark B, May 10 '18 at 17:58
@MarkB My current implementation reads the file as binary from s3 then uses `codecs.open` with `errors='ignore'` when reading, but that ends up corrupting the tarfile. I believe it has something to do with python attempting to decode the file when reading, but I'm unsure and stuck. If I attempt to use built-in `open` and `read` it throws an encoding error — asdf, May 10 '18 at 18:01
If the file is large at all then that's hugely wasteful writing it to /tmp first. You should be streaming the file from S3 to the response without writing it to the local file system. Have you looked at the answer here to see how to return binary data from a Python Lambda function to API Gateway? You also need to setup the API Gateway correctly to deal with the binary response. — Mark B, May 10 '18 at 18:08
@MarkB Yes I have my settings in the API gateway set to return binary for `*/*`. I'll look into streaming it back — asdf, May 10 '18 at 18:16
*"I wanted to leverage the client certificates available in API Gateway since that made the client-auth security easiest"* Client certificates in API Gateway are not used to authenticate the client... they they are for authenticating the gateway to a back-end non-AWS service. It isn't at all clear what this means or how it's related. — Michael - sqlbot, May 10 '18 at 22:44

Read tarfile in as bytes

0 Answers0