Why pyspark jobs are not running in parallel even the cluster has enough memory in GCP dataproc cluster?

Asked Feb 22 '23 at 06:38

Active Feb 22 '23 at 06:38

Viewed 68 times

I have a .yaml file which has 5 independent pyspark jobs means all 5 should run concurrently in the GCP dataproc and have scheduled this .yaml file in crontab for every 30 mins.

I have enough memory in cluster as wel to run all these jobs in parallel. But sometimes all of a sudden, these jobs will start executing one by one even though they are scheduled to run parallel. If I restart the cluster, then jobs will be fine like they will execute in parallel as expected.

Could anyone assist me if I am missing something, or do I need to add anything on the configuration side.

this issue appears often and will be resolved only if I restart the cluster which is not a recommended method since many services are running inside the cluster. I would like to know the root cause for this problem as wel as the solution.

asked Feb 22 '23 at 06:38

Subhash bhat

this looks like an internal issue, hence I would advice to file a support [ticket](https://cloud.google.com/support/docs) with GCP or you can create an thread on GCP's [issuetracker](https://cloud.google.com/support/docs/issue-trackers) explaining your issue. – Sakshi Gatyan Mar 02 '23 at 14:40

Why pyspark jobs are not running in parallel even the cluster has enough memory in GCP dataproc cluster?

0 Answers0