5

I have a parquet dataset stored in my S3 bucket with multiple partition files. I want to read it into my pandas dataframe, but am getting this ArrowInvalid error when I didn't before.

Occasionally, this data has been overwritten with some previous snapshot of pandas data like the following:

import pandas as pd  # version 1.3.4
# pyarrow version 5.0

df.to_parquet(
    f's3a://{bucket_and_prefix}',
    storage_options={
        "key"          : os.getenv("AWS_ACCESS_KEY_ID"),
        "secret"       : os.getenv("AWS_SECRET_ACCESS_KEY"),
        "client_kwargs": {
            'verify'      : os.getenv('AWS_CA_BUNDLE'),
            'endpoint_url': 'https://prd-data.company.com/'
        }
    },
    index=False
)

But when reading it with:

df = pd.read_parquet(
    f"s3a://{bucket_and_prefix}",
    storage_options={
        "key"          : os.getenv("AWS_ACCESS_KEY_ID"),
        "secret"       : os.getenv("AWS_SECRET_ACCESS_KEY"),
        "client_kwargs": {
            'verify'      : os.getenv('AWS_CA_BUNDLE'),
            'endpoint_url': 'https://prd-data.company.com/'
        }
    }
)

It fails with error:

ArrowInvalid: GetFileInfo() yielded path 'bucket/folder/data.parquet/year=2021/month=2/abcde.parquet', which is outside base dir 's3://bucket/folder/data.parquet'

Any idea why this ArrowInvalid error happens and how I can read the parquet data into pandas?

Wassadamo
  • 1,176
  • 12
  • 32
  • According to the pyarrow documentation, https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html you need to pass a `file_system` argument (typically an s3fs.FileSystem), otherwise it will use the local file system (which doesn't know about `s3://` – 0x26res Apr 29 '22 at 07:35
  • @0x26res that is for `pyarrow.parquet.read_table`, but I'm using [pd.read_parquet](https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/pandas.read_parquet.html) and I don't need to pass a file system. In fact I'm able to run the above for most parquet datasets in my S3 bucket. – Wassadamo May 02 '22 at 15:58
  • According to https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/pandas.read_parquet.html `Any additional kwargs are passed to the engine` as `**kwargs` so you can pass an s3 file system as an argument and it will be passed to `pyarrow.parquet.read_table` – 0x26res May 03 '22 at 12:19

0 Answers0