3

I have a miserably long running job to read in a dataset that has a natural, logical partition on US State. I have saved it as a partitioned parquet dataset from pandas using fastparquet (using pd.write_parquet).

I want my buddy to be able to read in just a single partition (state) from the parquet folder that's created. read_parquet doesn't have a filter ability. Any thoughts?

user3502355
  • 147
  • 1
  • 2
  • 14

1 Answers1

0

Try using either dask or parquet reader. Filtering via pandas has worked for me.

How to read parquet file with a condition using pyarrow in Python

RUN pip install pyarrow
RUN pip install "dask[complete]"

import pyarrow.parquet as pq
import dask.dataframe as dd
import pandas as pd

path = ""
dask_df = dd.read_parquet(path, columns=["col1", "col2"], engine="pyarrow")

dask_filter_df = dask_df[dask_df.col1 == "filter here"]

path = ""
parquet_pandas_df = pq.ParquetDataset(path).read_pandas().to_pandas()

pandas_filter_df = parquet_pandas_df[parquet_pandas_df.col1 == "filter here"]
thePurplePython
  • 2,621
  • 1
  • 13
  • 34