Reading parquet files in GCP using wildcards in spark

Question

I am trying to read parquet files using spark, if I want to read the data for June, I'll do the following:

"gs://bucket/Data/year=2021/month=6/file.parquet"

if I want to read the data for all the months, I'll do the following:

"gs://bucket/Data/year=2021/month=6/file.parquet"

if I want to read the first two days of May:

"gs://bucket/Data/year=2021/month=5/day={1,2}file.parquet"

if I want to read November and December:

"gs://bucket/Data/year=2021/month={11,12}/file.parquet"

you get the idea... but what if I have a dictionary of month, days key, value pairs.. for example {1: [1,2,3], 4: [10,11,12,13]} --> which means that I need to read the days [1,2,3] from January, and the days [10,11,12,13] from April. how would I reflect that as a wildcard to the path.

Thank you

It looks that the folder structure is already partitioned and so I think you can simply read whole data with some filter is enough. The filter will push down and only read for that specific partition. — Lamanus, Dec 18 '21 at 14:41

score 1 · Accepted Answer · answered Dec 18 '21 at 13:17

You can pass a list of paths to DataFrameReader:

months_dict = {1: [1, 2, 3], 4: [10, 11, 12, 13]}

paths = [
    f"gs://bucket/Data/year=2021/month={k}/day={{{','.join([str(d) for d in v])}}}/*.parquet"
    for k, v in months_dict.items()
]

print(paths)
# ['gs://bucket/Data/year=2021/month=1/day={1,2,3}/*.parquet', 'gs://bucket/Data/year=2021/month=4/day={10,11,12,13}/*.parquet']

df = spark.read.parquet(*paths)

Reading parquet files in GCP using wildcards in spark

1 Answers1