1

This is in AWS EMR cluster with 2 task nodes and a Master.

I'm trying the hello-samza that launches a yarn job. The job gets stuck in ACCEPTED STATE. I looked in other posts and it seems that my yarn getting no nodes. Any help on what yarn not getting task nodes will help.

[hadoop@xxx hello-samza]$ deploy/yarn/bin/yarn node -list
17/04/18 23:30:45 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032
Total Nodes:0
     Node-Id         Node-State Node-Http-Address   Number-of-Running-Containers

[hadoop@xxx hello-samza]$ deploy/yarn/bin/yarn application -list -appStates ALL
17/04/18 23:26:30 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032
Total number of applications (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED]):1
            Application-Id      Application-Name        Application-Type          User       Queue               State         Final-State         Progress                        Tracking-URL
application_1492557889328_0001    wikipedia-parser_1                   Samza        hadoop     default            ACCEPTED           UNDEFINED               0%                                 N/A
dvshekar
  • 93
  • 11
  • Lets see whether you have any unhealthy nodes, Post the output of `yarn node -list -all` – franklinsijo Apr 19 '17 at 06:12
  • [hadoop@xxx hello-samza]$ deploy/yarn/bin/yarn node -list ALL 17/04/19 16:48:59 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032 Total Nodes:0 Node-Id Node-State Node-Http-Address Number-of-Running-Containers – dvshekar Apr 19 '17 at 16:49
  • You have used a wrong argument. It is `-all` in lowercase. – franklinsijo Apr 19 '17 at 16:50
  • [hadoop@xxx hello-samza]$ deploy/yarn/bin/yarn node -list all 17/04/19 16:50:55 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032 Total Nodes:0 Node-Id Node-State Node-Http-Address Number-of-Running-Containers – dvshekar Apr 19 '17 at 16:51
  • Why do you miss the `-`? – franklinsijo Apr 19 '17 at 16:51
  • Sorry. Still same results. [hadoop@xxx hello-samza]$ deploy/yarn/bin/yarn node -list all - 17/04/19 16:53:24 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032 Total Nodes:0 Node-Id Node-State Node-Http-Address Number-of-Running-Containers – dvshekar Apr 19 '17 at 16:53
  • Sorry, you are using the wrong command again. The command is `yarn node -list -all`. – franklinsijo Apr 19 '17 at 16:54
  • sorry. I'm getting some output now. [hadoop@xxx hello-samza]$ deploy/yarn/bin/yarn node -list -all - 17/04/19 16:55:41 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032 Total Nodes:1 Node-Id Node-State Node-Http-Address Number-of-Running-Containers ixxx:34395 UNHEALTHY xxxl:8042 – dvshekar Apr 19 '17 at 16:56
  • You have only one node and that is UNHEALTHY. Please check the ResourceManager UI for the Cause. – franklinsijo Apr 19 '17 at 16:58
  • I tied to look into the yarn resource manager http:xxx:8032 and it gave - It looks like you are making an HTTP request to a Hadoop IPC port. This is not the correct port for the web interface on this daemon. – dvshekar Apr 19 '17 at 18:18
  • The port number is `8088` – franklinsijo Apr 19 '17 at 18:18
  • I tried http://xxx:8088 and there are 4 active nodes. Now, the job marked as UNHEALTHY does not show up in the list. Also, I'm running on AWS EMR machine. And starting yarn mannually. I don't know why it only shows the job command line and not in the console at port 8088. – dvshekar Apr 19 '17 at 18:21
  • You have 4 active nodes? The command listed only one node and it was unhealthy. The job wasn't marked unhealthy but the node was. And once you restart YARN all the history of jobs will be erased. – franklinsijo Apr 19 '17 at 18:25
  • I did not restart yarn yet. I'm saying that maybe EMR machine come with their yarn? But I've not yet restarted the yarn. I'm running yarn application command and I see the job in ACCEPTED STATE. But, I do not see it in the ui at 8088 port – dvshekar Apr 19 '17 at 18:27
  • As per the post 2 task nodes in the cluster, I do not understand how you can have 4 active nodes listed. Please verify the UI properly and provide more info on the nodemanagers' status. – franklinsijo Apr 19 '17 at 18:41
  • I figured that EMR has its own yarn and starting a new yarn from the hello-samza app is not desirable. I changed the config yarn-site.xml to point to what hadoop was configured default in EMR. So now I can see the 4 nodes. But next problem, when I run hello-samza it still trying to access yarn resource manager at 127.0.0.1 instead of the ip address default in EMR for yarn – dvshekar Apr 19 '17 at 21:41

2 Answers2

2

I made a complete answer for a similar case I've been experiencing: have a look at it, it might be this kind of conf issue

zar3bski
  • 2,773
  • 7
  • 25
  • 58
  • David, instead of posting an answer which merely links to another answer, please instead [flag the question](https://stackoverflow.com/help/privileges/flag-posts) as a duplicate. – 4b0 Jun 01 '18 at 12:59
  • @Shree I wouldn't call it a duplicate per se, rather a different issue probably sharing the same causes – zar3bski Jun 01 '18 at 13:02
  • Your answer fixed all of my problems almost at once (see my comment there), after hours of hopeless googling via heaps of rubbish info. I am grateful. – Eugene Gr. Philippov Feb 02 '19 at 16:34
  • Glad I could help – zar3bski Feb 02 '19 at 17:51
1

It seems like the nodemanagers are not running on either node (either not started at all or exited with error). Use jps command to check if all the daemons associated with YARN are running on the two nodes. Additionally, check both nodemanager logs to see if any exceptions might have killed it.

  • jps did not give me nodemanager and resourcemanager. I can see the job in the job tracker but FAILED. Application application_1492641052989_0008 failed 2 times due to AM Container for appattempt_1492641052989_0008_000002 exited with exitCode: -1000 For more detailed output, check application tracking page:http://xxxx.internal:8088/cluster/app/application_xxx Diagnostics: File file:/home/hadoop/samza/hello-samza/target/hello-samza-0.13.0-dist.tar.gz does not exist – dvshekar Apr 19 '17 at 23:38
  • @dvshekar Hope you have figured it out. But I am wondering if the file path is incorrect. Should it be "file://" ? Just a guess. In general, when you get this exception from Yarn, it means the RM is unable to localize your resource (which is the job package in this case). – Navina Ramesh May 01 '17 at 01:50
  • I think file:// was a typo from my side. I tried replicating the file in all nodes and think that removed the error. Only new errors started showing up. But, i think the file needs to be in hadoop hdfs. I tried that and still gave me some errors. – dvshekar May 02 '17 at 19:33