44

I am trying to read some logs from a Hadoop process that I run in AWS. The logs are stored in an S3 folder and have the following path.

bucketname = name key = y/z/stderr.gz Here Y is the cluster id and z is a folder name. Both of these act as folders(objects) in AWS. So the full path is like x/y/z/stderr.gz.

Now I want to unzip this .gz file and read the contents of the file. I don't want to download this file to my system wants to save contents in a python variable.

This is what I have tried till now.

bucket_name = "name"
key = "y/z/stderr.gz"
obj = s3.Object(bucket_name,key)
n = obj.get()['Body'].read()

This is giving me a format which is not readable. I also tried

n = obj.get()['Body'].read().decode('utf-8')

which gives an error utf8' codec can't decode byte 0x8b in position 1: invalid start byte.

I have also tried

gzip = StringIO(obj)
gzipfile = gzip.GzipFile(fileobj=gzip)
content = gzipfile.read()

This returns an error IOError: Not a gzipped file

Not sure how to decode this .gz file.

Edit - Found a solution. Needed to pass n in it and use BytesIO

gzip = BytesIO(n)
Reza Mousavi
  • 4,420
  • 5
  • 31
  • 48
Kshitij Marwah
  • 1,091
  • 3
  • 14
  • 23

7 Answers7

42

This is old, but you no longer need the BytesIO object in the middle of it (at least on my boto3==1.9.223 and python3.7)

import boto3
import gzip

s3 = boto3.resource("s3")
obj = s3.Object("YOUR_BUCKET_NAME", "path/to/your_key.gz")
with gzip.GzipFile(fileobj=obj.get()["Body"]) as gzipfile:
    content = gzipfile.read()
print(content)
Kirk
  • 1,779
  • 14
  • 20
  • 2
    I tried this and I get `expected str, bytes or os.PathLike object, not StreamingBody` – szeitlin Dec 16 '21 at 19:23
  • 1
    What python version, and what boto3 version? – Kirk Dec 17 '21 at 02:21
  • python 3.9 and boto3 1.18.26 – szeitlin Dec 17 '21 at 18:41
  • @szeitlin strange, not sure why you're getting that. works for me with the same versions as you. Note, be sure to use the `fileobj` kwarg since it takes `filename` by default. – Kirk Dec 17 '21 at 19:03
  • I ended up having an encoding issue also, so my final solution was more like what @Levi wrote below, i.e. I had to use BytesIO to wrap the result of reading the Body before I could do anything with gzip.GzipFile. – szeitlin Dec 17 '21 at 20:12
  • I had to do `obj = s3_client.get_object(Bucket="BUCKET_NAME", Key="KEY_PATH") gzipfile = gzip.GzipFile(fileobj=obj.get("Body"), mode='r') content = gzipfile.read() print(content)` – Quentin Del Oct 17 '22 at 15:48
21

@Amit, I was trying to do the same thing to test decoding a file, and got your code to run with some modifications. I just had to remove the function def, the return, and rename the gzip variable, since that name is in use.

import json
import boto3
from io import BytesIO
import gzip

try:
     s3 = boto3.resource('s3')
     key='YOUR_FILE_NAME.gz'
     obj = s3.Object('YOUR_BUCKET_NAME',key)
     n = obj.get()['Body'].read()
     gzipfile = BytesIO(n)
     gzipfile = gzip.GzipFile(fileobj=gzipfile)
     content = gzipfile.read()
     print(content)
except Exception as e:
    print(e)
    raise e
Levi
  • 221
  • 2
  • 7
10

You can use AWS S3 SELECT Object Content to read gzip contents

S3 Select is an Amazon S3 capability designed to pull out only the data you need from an object, which can dramatically improve the performance and reduce the cost of applications that need to access data in S3.

Amazon S3 Select works on objects stored in Apache Parquet format, JSON Arrays, and BZIP2 compression for CSV and JSON objects.

Ref: https://docs.aws.amazon.com/AmazonS3/latest/dev/selecting-content-from-objects.html

from io import StringIO
import boto3
import pandas as pd

bucket = 'my-bucket'
prefix = 'my-prefix'

client = boto3.client('s3')

for object in client.list_objects_v2(Bucket=bucket, Prefix=prefix)['Contents']:
    if object['Size'] <= 0:
        continue

    print(object['Key'])
    r = client.select_object_content(
            Bucket=bucket,
            Key=object['Key'],
            ExpressionType='SQL',
            Expression="select * from s3object",
            InputSerialization = {'CompressionType': 'GZIP', 'JSON': {'Type': 'DOCUMENT'}},
            OutputSerialization = {'CSV': {'QuoteFields': 'ASNEEDED', 'RecordDelimiter': '\n', 'FieldDelimiter': ',', 'QuoteCharacter': '"', 'QuoteEscapeCharacter': '"'}},
        )

    for event in r['Payload']:
        if 'Records' in event:
            records = event['Records']['Payload'].decode('utf-8')
            payloads = (''.join(r for r in records))
            try:
                select_df = pd.read_csv(StringIO(payloads), error_bad_lines=False)
                for row in select_df.iterrows():
                    print(row)
            except Exception as e:
                print(e)
rahulb
  • 970
  • 4
  • 12
  • 24
  • 3
    Thanks for the answer. This is great. I like the fact that it gives you data in chucks. There is a tiny problem with your solution, I noticed that sometimes S3 Select split the rows with one half of the row coming at the end of one payload and the next half coming at the beginning of the next. It's not that hard to fix, but still something to be aware of – Vlad Jun 27 '19 at 05:36
  • Note that although it works for objects that are compressed with GZIP or BZIP2, this is for CSV and JSON objects only – tdc May 12 '20 at 13:00
1

Read Bz2 extension file from aws s3 in python

import json
import boto3
from io import BytesIO
import bz2
try:
    s3 = boto3.resource('s3')
    key='key_name.bz2'
    obj = s3.Object('bucket_name',key)
    nn = obj.get()['Body'].read()
    gzipfile = BytesIO(nn)
    content = bz2.decompress(gzipfile.read())
    content = content.split('\n')
    print len(content)

except Exception as e:
    print(e)
1

I also stuck with reading contents of gzipped csv files from s3, got the same errors, but finally found a way to read a gzip.GZipFile and iterate through it's rows with csv.reader:

for obj in bucket.objects.filter(Prefix=folder_prefix):
    if obj.key.endswith(".gz"):
        with gzip.GzipFile(fileobj=obj.get()["Body"]) as gzipped_csv_file:
            csv_reader = csv.reader(StringIO(gzipped_csv_file.read().decode()))
            for line in csv_reader:
                process_line(line)
greenwd
  • 135
  • 1
  • 10
0

Just like what we do with variables, data can be kept as bytes in an in-memory buffer when we use the io module’s Byte IO operations.

Here is a sample program to demonstrate this:

mport io

stream_str = io.BytesIO(b"JournalDev Python: \x00\x01")
print(stream_str.getvalue())

The getvalue() function takes the value from the Buffer as a String.

So, the @Jean-FrançoisFabre answer is correct, and you should use

gzip = BytesIO(n)

For more information read the following doc:

https://docs.python.org/3/library/io.html

Reza Mousavi
  • 4,420
  • 5
  • 31
  • 48
0

Currently the file can be read as

import pandas as pd
role = 'role name'
bucket = 'bucket name'
data_key = 'data key'
data_location = 's3://{}/{}'.format(bucket, data_key)
data = pd.read_csv(data_location,compression='gzip', header=0, sep=',', quotechar='"')