DuckDB for reading multiple parquet files on s3

Question

I am trying to use DuckDB with the HTTPFS extension to query around 1000 parquet files with the same schema from an s3 bucket with a similar key.

When I query a single file with duckdb I'm able to get the table

import duckdb
import pandas as pd

cursor = duckdb.connect()

df = cursor.execute(f"""
                        INSTALL httpfs;
                        LOAD httpfs;
                        SET s3_region='{s3_region}';
                        SET s3_access_key_id='{access_key_id}';
                        SET s3_secret_access_key='{secret_access_key}';
                        
                        SELECT *
                            FROM parquet_scan(['s3:://bucket/folder/fname.parquet'],
                                               FILENAME = 1);
                        """).df()

However, when I use file globbing, as explained by the docs (https://duckdb.org/docs/extensions/httpfs), I get a duckdb.Error: Invalid Error: HTTP GET error, which is a HTTP 403 (Access Denied).

SELECT *
    FROM parquet_scan(['s3:://bucket/folder/*.parquet'],
                                               FILENAME = 1);

I thought this was just an AWS IAM permissions issue, but I've given list and read access to the entire bucket, so as far as I know, it isn't that.

What is causing this error?

One odd thing is I used boto3 to do list objects with the same access keys as the query, and I was able to get the data. I guess a quick hack would be just to use the output from boto3 list objects and concat the s3 uri's to pass to parquet_scan in the duckDB query. Not ideal, but doable. — A Simple Programmer, Oct 19 '22 at 20:48
Have you tried something like ```SELECT * FROM parquet_scan('s3://bucket/folder/*.parquet', FILENAME = 1);``` — Mause, Oct 20 '22 at 04:03
If you mean without the brackets, yes. Doesn't work I'm afraid. Maybe glob matching isn't supported in parquet_scan only read_parquet? — A Simple Programmer, Oct 23 '22 at 15:03
When configured with the same access key, does `aws s3 ls s3://bucket/folder/` get the 403? If so, please show the IAM statement containing the s3:ListBucket action. — vdm, Oct 25 '22 at 14:55
I did not, but that appears to fix it. It is weird that I could access the s3 bucket when I was just scanning one parquet, but I needed the session_token when using glob. Does this mean we need to create a session token anytime we want to query s3? That seems a bit much when you are just doing dev work on your own machine with your personal IAM credentials. I had a similar problem with HTTPFS calls in duckdb before https://stackoverflow.com/questions/74789412/aws-role-vs-iam-credential-in-duckdb-httpfs-call — A Simple Programmer, Jan 19 '23 at 22:29

score 0 · Answer 1 · answered Oct 24 '22 at 16:50

Here are the tests that we run in CI/CD to test the S3 globbing. I don't see any issues with your syntax, so maybe it is a configuration or access issue of some kind?

https://github.com/duckdb/duckdb/blob/4a24d71edecc7c0018eb3860d2e104cfe90462b6/test/sql/copy/parquet/parquet_glob_s3.test

DuckDB for reading multiple parquet files on s3

1 Answers1