0

I have a GCS storage which the data is partitioned like this: year/month/day plus Dataproc Cluster that have 89 executor in 30 worker with 24g memory per executor.

The question is, when i want to read the parquet files on 2016/5/*

Somehow, the worker that active just 1 worker with memory used is 21g.

Another 29 worker is idle, when the other one is trying to load a lot of parquet files.

Is there any technique to read parquet files which could utilize 30 worker ? Because 1 worker read parquet, sounds like bottleneck occured.

ByanJati
  • 83
  • 1
  • 11
  • Can you provide code samples? How many files are in the 2016/5 directory? Are you performing any transformations later that might end up grouping all data on a siingle executor? – Angus Davis Feb 03 '18 at 01:32
  • My code is just spark.read.parquest(gcspath), the folder on 2016/5 directory is up to 31*24 (which is the number of day * 24 hour), 744 folders lie under 2016/5/*. But idk the exact number of the file under it. Maybe if the count files is too big, I have to group the file into bigger files ? Maybe this is the problem of a lot of small files ? The code is just reading spark, not yet transforming or grouping data. – ByanJati Feb 03 '18 at 01:38
  • For distributing the workers load and thereby making the job to run faster, you can try [Scaling clusters](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scaling-clusters) as well. – Digil Feb 05 '18 at 21:38
  • @Digil we have just 2 workers out from 30 workers utilized, scaling the clusters might be increase the underutilized workers. But, thanks for the reply. – ByanJati Feb 06 '18 at 00:50
  • @ByanJati 'Scaling clusters' can also be used to decrease the workers. Since you have only 2 workers (on utilization), you can use this to scale down the workers number to 2 or any other optimum number (less than 30) – Digil Feb 07 '18 at 22:20

0 Answers0