2

I try to read a parquet file from AWS S3.

The same code works on my windows machine.

A Google search produced no results.

Pandas should use fastparquet in order to build the dataframe. fastparquet is installed.

Code:

import boto3
import pandas as pd


def get_parquet_from_s3(bucket_name, file_name):
    """
    :param bucket_name:
    :param file_name:
    :return:
    """
    df = pd.read_parquet('s3://{}/{}'.format(bucket_name, file_name))
    print(df.head())

get_parquet_from_s3('my_bucket_name','my_file_name')

I get the exception below:

/home/ubuntu/.local/lib/python3.6/site-packages/numba/errors.py:131: UserWarning: Insufficiently recent colorama version found. Numba requires colorama >= 0.3.9
  warnings.warn(msg)
Traceback (most recent call last):
  File "test_pd_read_parq.py", line 15, in <module>
    get_parquet_from_s3('my_bucket_name','my_file_name')
  File "test_pd_read_parq.py", line 12, in get_parquet_from_s3
    df = pd.read_parquet('s3://{}/{}'.format(bucket_name, file_name))
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/io/parquet.py", line 294, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/io/parquet.py", line 192, in read
    parquet_file = self.api.ParquetFile(path, open_with=s3.s3.open)
AttributeError: 'S3File' object has no attribute 's3'

Software & OS versions

python        : 3.6  
pandas        : 0.25.0
s3fs          : 0.3.1
ubuntu        : 18.04
fastparquet   : 0.3.1
boto3         : 1.9.198
botocore      : 1.12.198

The workaround

import s3fs
from fastparquet import ParquetFile

def get_parquet_from_s3(bucket_name, file_name
    s3 = s3fs.S3FileSystem()
    pf = ParquetFile('{}/{}'.format(bucket_name, file_name), open_with=s3.open)
    df = pf.to_pandas()
Andrew Gaul
  • 2,296
  • 1
  • 12
  • 19
balderman
  • 22,927
  • 7
  • 34
  • 52
  • I would guess that it's about versions of `boto3`, `botocore` and `fastparquet` - even if you have the newest, they may be conflicting (this happens to me a lot with fastparquet vs. botocore). – michcio1234 Jul 30 '19 at 10:43
  • Alternatively, you may try to open the file with `s3fs` and pass the file object to `pandas.read_parquet`. – michcio1234 Jul 30 '19 at 10:43
  • I have updated the post with the workaround I found (Thanks michio1234) – balderman Jul 30 '19 at 12:43

2 Answers2

3

For python 3.6+ AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet

to install do;

pip install awswrangler

to read parquet from s3 do;

import awswrangler as wr
df = wr.pandas.read_parquet(path="s3://my-bucket/my/path/to/parquet-file.parquet")
Vincent Claes
  • 3,960
  • 3
  • 44
  • 62
2

You can use s3fs and Pyarrow for reading the parquet files from S3 as below.

import s3fs

import pyarrow.parquet as pq

s3 = s3fs.S3FileSystem()

pandas_dataframe = pq.ParquetDataset(
    's3://bucket/file.parquet',
    filesystem=s3,
).read_pandas().to_pandas()
edesz
  • 11,756
  • 22
  • 75
  • 123
prasanth
  • 41
  • 1
  • For simpler syntax, `.read_table()` ([link](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html)) can be used: `pq.read_table('s3://bucket/file.parquet', filesystem=s3).to_pandas()`. The name of a single file or a directory can be used. With this approach, the call to `.read_pandas()` can be excluded. – edesz Jun 20 '22 at 19:31