How to load file from custom hosted Minio s3 bucket into pandas using s3 URL format?

Question

I have Minio server hosted locally. I need to read file from minio s3 bucket using pandas using S3 URL like "s3://dataset/wine-quality.csv" in Jupyter notebook.

I tried using s3 boto3 library am able to download file.

import boto3
s3 = boto3.resource('s3',
                endpoint_url='localhost:9000',
                aws_access_key_id='id',
                aws_secret_access_key='password')
s3.Bucket('dataset').download_file('wine-quality.csv', '/tmp/wine-quality.csv')

But when I try using pandas,

data = pd.read_csv("s3://dataset/wine-quality.csv")

I'm getting client Error, Forbidden 403. I know that pandas internally use boto3 library(correct me if am wrong)

PS: Pandas read_csv has one more param, " storage_options={ "key": AWS_ACCESS_KEY_ID, "secret": AWS_SECRET_ACCESS_KEY, "token": AWS_SESSION_TOKEN, }". But I couldn't find any configuration for passing custom Minio host URL for pandas to read.

score 8 · Accepted Answer · answered Feb 18 '22 at 21:55

8

Pandas v1.2 onwards allows you to pass storage options which gets passed down to fsspec, see the docs here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html?highlight=s3fs#reading-writing-remote-files.

To pass in a custom url, you need to specify it through client_kwargs in storage_options:

df = pd.read_csv(
    "s3://dataset/wine-quality.csv",
    storage_options={
        "key": AWS_ACCESS_KEY_ID,
        "secret": AWS_SECRET_ACCESS_KEY,
        "token": AWS_SESSION_TOKEN,
        "client_kwargs": {"endpoint_url": "localhost:9000"}
    }
)

answered Feb 18 '22 at 21:55

Ashton Sidhu

96
1
2

It needs s3fs library to be installed. – veeresh patil Feb 22 '22 at 10:12
1

Yep! Can be installed with `pip install s3fs` – Ashton Sidhu Feb 23 '22 at 02:49

How to load file from custom hosted Minio s3 bucket into pandas using s3 URL format?

1 Answers1

Linked