1

I am spawning a spark cluster using an azure function inside an azure vnet. Each and every spark node is in a separate Azure container instance group. This is the way it works: I spawn a master node ACI group and get its IP address and then spawn the slave nodes ACI groups. I pass the ip address of the master nodes while spawning the workers. But, the problem that I was facing was - if I submitted a job using spark submit the job was not able to finish and I was getting the following errors:

19/05/08 13:35:26 INFO BlockManagerMaster: Removal of executor 1 requested
19/05/08 13:35:26 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
19/05/08 13:35:26 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20190508133523-0015/2 on worker-20190508124850-10.0.0.6-33015 (10.0.0.6:33015) with 1 core(s)
19/05/08 13:35:26 INFO StandaloneSchedulerBackend: Granted executor ID app-20190508133523-0015/2 on hostPort 10.0.0.6:33015 with 1 core(s), 1024.0 MB RAM
19/05/08 13:35:26 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190508133523-0015/2 is now RUNNING
19/05/08 13:35:28 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190508133523-0015/2 is now EXITED (Command exited with code 1)
19/05/08 13:35:28 INFO StandaloneSchedulerBackend: Executor app-20190508133523-0015/2 removed: Command exited with code 1
19/05/08 13:35:28 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
19/05/08 13:35:28 INFO BlockManagerMaster: Removal of executor 2 requested
19/05/08 13:35:28 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 2
19/05/08 13:35:28 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20190508133523-0015/3 on worker-20190508124850-10.0.0.6-33015 (10.0.0.6:33015) with 1 core(s)
19/05/08 13:35:28 INFO StandaloneSchedulerBackend: Granted executor ID app-20190508133523-0015/3 on hostPort 10.0.0.6:33015 with 1 core(s), 1024.0 MB RAM
19/05/08 13:35:28 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190508133523-0015/3 is now RUNNING
19/05/08 13:35:30 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190508133523-0015/3 is now EXITED (Command exited with code 1)
19/05/08 13:35:30 INFO StandaloneSchedulerBackend: Executor app-20190508133523-0015/3 removed: Command exited with code 1
19/05/08 13:35:30 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
19/05/08 13:35:30 INFO BlockManagerMaster: Removal of executor 3 requested
19/05/08 13:35:30 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 3
19/05/08 13:35:30 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20190508133523-0015/4 on worker-20190508124850-10.0.0.6-33015 (10.0.0.6:33015) with 1 core(s)
19/05/08 13:35:30 INFO StandaloneSchedulerBackend: Granted executor ID app-20190508133523-0015/4 on hostPort 10.0.0.6:33015 with 1 core(s), 1024.0 MB RAM
19/05/08 13:35:30 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190508133523-0015/4 is now RUNNING
19/05/08 13:35:32 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190508133523-0015/4 is now EXITED (Command exited with code 1)
19/05/08 13:35:32 INFO StandaloneSchedulerBackend: Executor app-20190508133523-0015/4 removed: Command exited with code 1
19/05/08 13:35:32 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
19/05/08 13:35:32 INFO BlockManagerMaster: Removal of executor 4 requested

After a lot of research I discovered that I have to make an entry into the /etc/hosts file of each and every node about all the nodes in the cluster something like the following:

10.0.0.4 spark-master
10.0.0.5 spark-worker-1
10.0.0.6 spark-worker-2
10.0.0.7 spark-driver

I made the entries like above manually and then the job was executed successfully.

However, how do I make the above entries programatically (using azure function itself) i.e. How do i get the IP address and hostname of each and every node in the cluster and make the entries for each of them in the /etc/hosts file of every other node after the Azure container instances are running?

Kamal Nandan
  • 233
  • 1
  • 5
  • 11

0 Answers0