How can I query parquet files with the Polars Python API?

Question

I have a .parquet file, and would like to use Python to quickly and efficiently query that file by a column.

For example, I might have a column name in that .parquet file and want to get back the first (or all of) the rows with a chosen name.

How can I query a parquet file like this in the Polars API, or possibly FastParquet (whichever is faster)?

I thought pl.scan_parquet might be helpful but realised it didn't seem so, or I just didn't understand it. Preferably, though it is not essential, we would not have to read the entire file into memory first, to reduce memory and CPU usage.

I thank you for your help.

Did you try to use it? e.g. `df = pl.scan_parquet(...).filter(pl.col("name") == chosen_name).collect()` — jqurious, Feb 17 '23 at 17:32
I have not as I am very new to Polars. I will try that. Thank you — SamTheProgrammer, Feb 17 '23 at 18:56

score 1 · Accepted Answer · answered Aug 29 '23 at 13:32

Speaking for fastparquet...

Fastparquet is a library for quickly loading parquet data into a pandas dataframe. You didn't say what query you wanted to run on it, but that would be up to pandas (and probably quite fast). Fastparquet does allow a number of options in the loading stage, for instance to filter values or pick columns or choose dtypes, and these can all make a significant different to load time, but will affect what queries you can then do. Without knowing the latter, we cannot advise on the former (and polars would agree).

How can I query parquet files with the Polars Python API?

1 Answers1