How to scan partitioned parquet file from gcs into polars?

Asked Jun 07 '23 at 12:31

Active Jun 07 '23 at 12:31

Viewed 228 times

I am trying to scan a folder of multiple parquet file into a polars dataframe. On this question the following is given as an answer using s3.

from pyarrow.dataset import dataset
import gcsfs
import polars as pl

# setup cloud filesystem access
cloudfs = gcsfs.GCSFileSystem(project="my-project")

# reference multiple parquet files
pyarrow_dataset = dataset(
    source = "gs://bucket/path/*.parquet",
    filesystem = cloudfs,
    format = 'parquet',
)

# load efficiently into polars
ldf = pl.scan_pyarrow_dataset( pyarrow_dataset )

When trying to change this to use gcs filesystem I get the following error:

AttributeError: 'GCSFileSystem' object has no attribute 'schema'

Is it possible to read multiple parquet files directly into a polars dataframe?

asked Jun 07 '23 at 12:31

EricLeer

On what command do you get that error? Can you do, for example, `cloudfs.ls("")` – Dean MacGregor Jun 08 '23 at 23:54

How to scan partitioned parquet file from gcs into polars?

0 Answers0