1

I created Google Dataproc cluster with 2 workers using n1-standard-4 VMs for master and workers.

I want to submit jobs on a given cluster and all jobs should run sequentially (like on AWS EMR), i.e., if first job is in running state then upcoming job goes to pending state, after completing first job, second job starts running.

I tried with submitting jobs on cluster but it run all jobs in parallel - no jobs went to pending state.

Is there any configuration that I can set in Dataproc cluster so all jobs will run sequentially?

Updated following files :

/etc/hadoop/conf/yarn-site.xml

  <property>
      <name>yarn.resourcemanager.scheduler.class</name>
      <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
   </property>
   <property>
      <name>yarn.scheduler.fair.user-as-default-queue</name>
      <value>false</value>
   </property>
   <property>
      <name>yarn.scheduler.fair.allocation.file</name>
      <value>/etc/hadoop/conf/fair-scheduler.xml</value>
   </property>

/etc/hadoop/conf/fair-scheduler.xml

<?xml version="1.0" encoding="UTF-8"?>
<allocations>
   <queueMaxAppsDefault>1</queueMaxAppsDefault>
</allocations>

After that restart services using this command systemctl restart hadoop-yarn-resourcemanager the above changes on master node. But still job running in parallel.

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
Neo-coder
  • 7,715
  • 4
  • 33
  • 52

1 Answers1

3

Dataproc tries to execute submitted jobs in parallel if resources are available.

To achieve sequential execution you may want to use some orchestration solution, either Dataproc Workflows or Cloud Composer.

Alternatively, you may want to configure YARN Fair Scheduler on Dataproc and set queueMaxAppsDefault property to 1.

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
  • I tried `fair scheduler` but still all jobs running parallel. – Neo-coder Mar 04 '19 at 11:17
  • Try to set `yarn.scheduler.fair.user-as-default-queue` YARN property to `false`: https://stackoverflow.com/a/43194378/3227693 – Igor Dvorzhak Mar 04 '19 at 15:33
  • May you share your fair scheduler configuration and init action that you use to configure it? – Igor Dvorzhak Mar 04 '19 at 15:34
  • I also did this changes on all worker node and restart service but still it show same issue – Neo-coder Mar 12 '19 at 10:07
  • How did you update these configuration files? May you share cluster creation command and init actions that you used? – Igor Dvorzhak Mar 13 '19 at 00:28
  • You may want to try to follow these instructions to configure fair scheduler on Dataproc: https://stackoverflow.com/a/49693693/3227693 – Igor Dvorzhak Mar 13 '19 at 00:33
  • 1
    Thanks for this link, it works as charm !!!. But I have one issue in jobs dashboard jobs status show `running` but in detailed logs of jobs shows that it's in waiting state. Instead of this is possible to show on jobs dashboard status as `pending` for all incoming jobs ? – Neo-coder Mar 13 '19 at 15:49
  • This is because from Dataproc point of view job is running (i.e. it was started and submitted to YARN), but YARN puts it in waiting state - Datparoc job status is a different thing than YARN job status and there no way to surface YARN job status in Dataproc jobs page. – Igor Dvorzhak Mar 13 '19 at 17:15
  • 1
    Okay, got it Thanks for your help – Neo-coder Mar 13 '19 at 17:28