I have a GCS storage which the data is partitioned like this: year/month/day plus Dataproc Cluster that have 89 executor in 30 worker with 24g memory per executor.
The question is, when i want to read the parquet files on 2016/5/*
Somehow, the worker that active just 1 worker with memory used is 21g.
Another 29 worker is idle, when the other one is trying to load a lot of parquet files.
Is there any technique to read parquet files which could utilize 30 worker ? Because 1 worker read parquet, sounds like bottleneck occured.