7

Say I have a task with the following dependency structure

class ParentTask(luigi.Task):
    def requires(self):
        return [ChildTask(classLevel=x) for x in self.class_level_list]
    def run(self):
        yadayda

The child task runs fine on it own. The parent correctly checks all the children tasks for finish status. Yet when the first child task finishes, the scheduler mark the parent task as finished. with the following message:

   Scheduled 15 tasks of which:
* 3 ran successfully:
    - 1 CleanRecord(...)
    - 1 EstimateQuestionParameter(classLevel=6, qdt=2016-04-19, subject=english)
    - 1 GetLog(classLevel=6, qdt=2016-04-19, subject=english)
* 12 were left pending, among these:
    * 12 were left pending because of unknown reason:
        - 5 EstimateQuestionParameter(classLevel=1...5, qdt=2016-04-19, subject=english)
        - 5 GetLog(pool=None, classLevel=1...5, qdt=2016-04-19, subject=english)
        - 1 UpdateQuestionParameter(qdt=2016-04-19, lastQdt=2016-03-23, subject=english, isInit=False)
        - 1 UpdateQuestionParameterBuffer(qdt=2016-04-19, subject=english, src_table=edw.edw_behavior_question_record_exam_new)

This progress looks :) because there were no failed tasks or missing external dependencies
Junchen
  • 1,749
  • 2
  • 18
  • 25
  • never saw this error happen... I think it'll be quite hard to know what's going on without seeing the code you're running – matagus Apr 27 '16 at 20:51
  • Do you have a suspect? I cannot post all the src code but can construct some pseudo-code that is representative of the task. – Junchen Apr 28 '16 at 01:55
  • @Junchen Please post the relevant code or pseudo-code. – Yogesh Yadav May 04 '16 at 01:44
  • Same thing just happened to me, did you ever find out the cause? – Mauricio Scheffer May 23 '16 at 12:01
  • 1
    In my case, it turned out to be a worker that got disconnected (i.e. stopped responding to pings). – Mauricio Scheffer May 23 '16 at 13:57
  • @Junchen when do you define self.class_level_list? For me I had erratic behavior when changing variables in the run method of a task and then using that to influence outputs or requirements. Is this a property? – thegeebe Jun 10 '16 at 18:32
  • @thegeebe, the class_level_list is static. – Junchen Jun 13 '16 at 04:58
  • @Mauricio Scheffer, I only has one worker. – Junchen Jun 13 '16 at 04:59
  • I also got this problem, just before it quits it shows an INFO log containing this: `Worker Worker(salt=02346543, workers=2, host=ip-x-x-x-x, username=hadoop, pid=31121) was stopped. Shutting down Keep-Alive thread` – arno_v Jun 13 '16 at 12:07

1 Answers1

2

I think this happens because your worker gets disconnected from the scheduler. The worker's heartbeats don't reach scheduler because of network partition or, more likely, because they're never sent due to this issue.

You have two options to work-around the problem:

  • Increase worker-disconnect-delay setting ([scheduler] section in config, default 60s)
  • Use more than one worker for your job, e.g. --workers 2 (if it's the latter reason)
Jakub Kukul
  • 12,032
  • 3
  • 54
  • 53
  • I add keep_alive in the worker section of the luigi config file. The problem does not reappear. I think your solution is right. – Junchen Jul 27 '17 at 03:19