2

I am trying to scan a folder of multiple parquet file into a polars dataframe. On this question the following is given as an answer using s3.

from pyarrow.dataset import dataset
import gcsfs
import polars as pl

# setup cloud filesystem access
cloudfs = gcsfs.GCSFileSystem(project="my-project")

# reference multiple parquet files
pyarrow_dataset = dataset(
    source = "gs://bucket/path/*.parquet",
    filesystem = cloudfs,
    format = 'parquet',
)

# load efficiently into polars
ldf = pl.scan_pyarrow_dataset( pyarrow_dataset )

When trying to change this to use gcs filesystem I get the following error:

AttributeError: 'GCSFileSystem' object has no attribute 'schema'

Is it possible to read multiple parquet files directly into a polars dataframe?

EricLeer
  • 41
  • 4

0 Answers0