1

I setup a Mesos cluster which runs Apache Aurora framework, and i registered 100 cron jobs which run every min on a 5 slave machine pool. I found after scheduled 100 times, the cron jobs stacked in "PENDING" state. May i ask what kind of logs i can inspect and what is the possible problem ?enter image description here

Jonas
  • 121,568
  • 97
  • 310
  • 388
Ken Chen
  • 159
  • 2
  • 7

1 Answers1

2

It could be a couple of things:

  • Do you still have sufficient resources in your cluster?
  • Are those resources offered to Aurora? Or maybe only to another framework?
  • Do you have any task constraints that prevent your tasks from being scheduled?

Possible information source:

  • What does the tooltip or the expanded status say on the UI? (as shown in the screenshot)
  • The Aurora scheduler has log files. However normally those are not needed for an end user to figure out why stuff is stuck in pending.

In case you are stuck here, it would probably be the best to drop by in the #aurora IRC channel on freenode.

buczek
  • 2,011
  • 7
  • 29
  • 40
serb
  • 21
  • 1
  • This is all great advice. The one thing I'd add: if you're in need of more direct help, feel free to drop by our IRC channel: #aurora on irc.freenode.net or subscribe to the Aurora users list (more details here: http://aurora.apache.org/community/). – Joshua Cohen Apr 29 '16 at 18:04
  • Thanks for the advice. 1. First the executable is a very simple program which opens file and write a number. I think there are sufficient resources in the cluster since I have 40 COREs and 40 GB memory in total. I am wondering if i can take a look at the log to see if there is resource issue. 2. The Aurora is the only framework run on Mesos 3. No task constraints After expanding the "Status", it says "a minute ago - PENDING". I am wondering where i can find the Aurora logs ? /var/log/aurora ? Nothing interesting there. – Ken Chen May 01 '16 at 02:47
  • Is the framework run with a reachable IP address, i.e. not 127.0.0.1? What does the Mesos Master log say? – Tobi May 02 '16 at 15:43
  • mesos-master.ERROR shows `E0504 15:02:58.156879 8382 socket.hpp:174] Shutdown failed on fd=34: Transport endpoint is not connected [107]` and mesos-master.WARNING shows `W0504 15:04:47.189374 8374 master.hpp:1532] Master attempted to send message to disconnected framework 3fe16547-091a-4b24-8646-250f489dcbb3-0001 (TwitterScheduler) at scheduler-b3334e43-7e13-4d51-820d-6b670513ac7a@127.0.0.1:8083` – Ken Chen May 04 '16 at 14:06