1

I'd like to read a partitioned parquet file into a polars dataframe.

In spark, it is simple:

df = spark.read.parquet("/my/path")

The polars documentation says that it should work the same way:

df = pl.read_parquet("/my/path")

But it gives me the error:

raise IsADirectoryError(f"Expected a file path; {path!r} is a directory")

How to read this file?

Marcus Müller
  • 34,677
  • 4
  • 53
  • 94
lmocsi
  • 550
  • 2
  • 17
  • 1
    Have you tried specifying a file path instead of a directory path? – mkrieger1 Apr 24 '23 at 14:48
  • I think the answer is no - partitions won't work the same as Spark. You'll need to provide one parquet file only. Otherwise, scan_parquet function accepts glob pattern – OneCricketeer Apr 24 '23 at 14:50

2 Answers2

3

Here's a snippet of the source code:

if isinstance(source, str) and "*" in source and _is_local_file(source):
    from polars import scan_parquet

    scan = scan_parquet(
            source,
            n_rows=n_rows,
            rechunk=True,
            parallel=parallel,
            row_count_name=row_count_name,
            row_count_offset=row_count_offset,
            low_memory=low_memory,
        )

The important bit is that it's looking for an * in the source path.

So it seems you just need to do

df = pl.read_parquet("/my/path/*")

This only works on local filesystems so if you're reading from cloud storage then you'd have to use pyarrow datasets to read multiple files at once without iterating over them yourself.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • It is on cloud, so it is on polars developers, to add this functionality... :( – lmocsi Apr 25 '23 at 09:12
  • 2
    I don't think it's *on* the developers to do anything. The functionality exists *through* pyarrow dataset. Check out https://arrow.apache.org/docs/python/dataset.html#reading-from-cloud-storage and then https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_pyarrow_dataset.html – Dean MacGregor Apr 25 '23 at 14:35
  • also this https://stackoverflow.com/questions/74280212/polars-scan-s3-multi-part-parquet-files/74280532#74280532 – Dean MacGregor Apr 25 '23 at 14:37
2

As an example using S3 (since you say your files are cloud-hosted), you first establish a filesystem connection (via fsspec) and a dataset against it (as suggested by Dean) and then read into polars from that:

from pyarrow.dataset import dataset
from s3fs import S3FileSystem
import polars as pl

# setup cloud filesystem access
cloudfs = S3FileSystem( ... )

# reference multiple parquet files
pyarrow_dataset = dataset(
    source = "s3://bucket/path/*.parquet",
    filesystem = cloudfs,
    format = 'parquet',
)

# load efficiently into polars
ldf = pl.scan_pyarrow_dataset( pyarrow_dataset )