How to read partitioned parquet file into polars?

Question

I'd like to read a partitioned parquet file into a polars dataframe.

In spark, it is simple:

df = spark.read.parquet("/my/path")

The polars documentation says that it should work the same way:

df = pl.read_parquet("/my/path")

But it gives me the error:

raise IsADirectoryError(f"Expected a file path; {path!r} is a directory")

How to read this file?

Have you tried specifying a file path instead of a directory path? — mkrieger1, Apr 24 '23 at 14:48
I think the answer is no - partitions won't work the same as Spark. You'll need to provide one parquet file only. Otherwise, scan_parquet function accepts glob pattern — OneCricketeer, Apr 24 '23 at 14:50

Dean MacGregor · Answer 1 · 2023-04-24T17:18:32.343

3

Here's a snippet of the source code:

if isinstance(source, str) and "*" in source and _is_local_file(source):
    from polars import scan_parquet

    scan = scan_parquet(
            source,
            n_rows=n_rows,
            rechunk=True,
            parallel=parallel,
            row_count_name=row_count_name,
            row_count_offset=row_count_offset,
            low_memory=low_memory,
        )

The important bit is that it's looking for an * in the source path.

So it seems you just need to do

df = pl.read_parquet("/my/path/*")

This only works on local filesystems so if you're reading from cloud storage then you'd have to use pyarrow datasets to read multiple files at once without iterating over them yourself.

edited Apr 24 '23 at 17:18

answered Apr 24 '23 at 15:02

Dean MacGregor

11,847
9
34
72

It is on cloud, so it is on polars developers, to add this functionality... :( – lmocsi Apr 25 '23 at 09:12
2

I don't think it's *on* the developers to do anything. The functionality exists *through* pyarrow dataset. Check out https://arrow.apache.org/docs/python/dataset.html#reading-from-cloud-storage and then https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_pyarrow_dataset.html – Dean MacGregor Apr 25 '23 at 14:35
also this https://stackoverflow.com/questions/74280212/polars-scan-s3-multi-part-parquet-files/74280532#74280532 – Dean MacGregor Apr 25 '23 at 14:37

score 2 · Answer 2 · answered Apr 28 '23 at 06:13

As an example using S3 (since you say your files are cloud-hosted), you first establish a filesystem connection (via fsspec) and a dataset against it (as suggested by Dean) and then read into polars from that:

from pyarrow.dataset import dataset
from s3fs import S3FileSystem
import polars as pl

# setup cloud filesystem access
cloudfs = S3FileSystem( ... )

# reference multiple parquet files
pyarrow_dataset = dataset(
    source = "s3://bucket/path/*.parquet",
    filesystem = cloudfs,
    format = 'parquet',
)

# load efficiently into polars
ldf = pl.scan_pyarrow_dataset( pyarrow_dataset )

How to read partitioned parquet file into polars?

2 Answers2

Linked