2

We run Quartz.NET for a variety of jobs on different schedules ranging from every 30 seconds to once a week.

On reviewing our internal logging, we find that some jobs have ceased running for no discernible reason, even while others continued. As an example, our every-30-second job failed at a given time, while a different every-10-minute job continued for a few hours, then also failed. A daily task ceased later on.

We enabled Quartz logging and found the following.

LOG OF PREVIOUS FIRE, WHICH WAS SUCCESSFUL:

2014-09-19 08:20:00.0130 DEBUG Producing instance of Job 'DEFAULT.Scheduled task #5', class=TaskRunner
2014-09-19 08:20:00.0130 DEBUG Calling Execute on job DEFAULT.Scheduled task #5 
2014-09-19 08:20:00.0130 DEBUG Batch acquisition of 1 triggers
2014-09-19 08:20:00.8710 DEBUG Trigger instruction : NoInstruction 
2014-09-19 08:20:00.8710 DEBUG Batch acquisition of 1 triggers

LOG OF FIRST FAILURE:

2014-09-19 08:30:00.0046 DEBUG Producing instance of Job 'DEFAULT.Scheduled task #5', class=TaskRunner
2014-09-19 08:30:00.0046 DEBUG Calling Execute on job DEFAULT.Scheduled task #5
2014-09-19 08:30:00.0046 DEBUG Batch acquisition of 1 triggers

After this, this particular job never ran again until we restarted the service. There is no indication that any of our code was run on this particular instance as we do our own logging internally, which had not occurred at that time.

Our misfire handling is configured for every job as follows:

        ... TriggerBuilder.Create()
            .WithCronSchedule( task.CronSchedule, x => x.WithMisfireHandlingInstructionDoNothing())
            .Build();

I understand the "DoNothing" instruction tells it to skip this fire and continue with the schedule. Therefore if a misfire occurred I would expect it to fire again on the next fire time.

1) Why are our Quartz jobs failing at random times?

2) What can we do to investigate further?

skaffman
  • 398,947
  • 96
  • 818
  • 769
Brendan Hill
  • 3,406
  • 4
  • 32
  • 61

3 Answers3

5

Use the source, Luke! Quartz is open source. So dig in there!

Searching for that log message ("Calling execute") leads me to this code (latest source) in JobRunShell.cs:

// Execute the job
try
{
   if (log.IsDebugEnabled)
   {
        log.Debug("Calling Execute on job " + jobDetail.Key);
   }
            job.Execute(jec);
    endTime = SystemTime.UtcNow();
}
catch (JobExecutionException jee)
{
    endTime = SystemTime.UtcNow();
  jobExEx = jee;
  log.Info(string.Format(CultureInfo.InvariantCulture, "Job {0} threw a JobExecutionException: ", jobDetail.Key), jobExEx);
}
catch (Exception e)
{
    endTime = SystemTime.UtcNow();
    log.Error(string.Format(CultureInfo.InvariantCulture, "Job {0} threw an unhandled Exception: ", jobDetail.Key), e);
    SchedulerException se = new SchedulerException("Job threw an unhandled exception.", e);
    qs.NotifySchedulerListenersError(
        string.Format(CultureInfo.InvariantCulture, "Job ({0} threw an exception.", jec.JobDetail.Key), se);
    jobExEx = new JobExecutionException(se, false);
}
jec.JobRunTime = endTime - startTime;

// notify all job listeners
if (!NotifyJobListenersComplete(jec, jobExEx))
{
    break;
}
instCode = SchedulerInstruction.NoInstruction;
// update the trigger
try
{
    instCode = trigger.ExecutionComplete(jec, jobExEx);
    if (log.IsDebugEnabled)
    {
        log.Debug(string.Format(CultureInfo.InvariantCulture, "Trigger instruction : {0}", instCode));
     }
}
catch (Exception e)
{
    // If this happens, there's a bug in the trigger...
    SchedulerException se = new SchedulerException("Trigger threw an unhandled exception.", e);
    qs.NotifySchedulerListenersError("Please report this error to the Quartz developers.", se);
 }

So, looking at your output, we see the log message on line 6 in the code sample above. However, we never see the trigger cleanup (3rd to last line) output.

Notice that in every catch statement in that code, we are creating scheduler exceptions and notifying listeners?

Well, the choice is clear: add a new SchedulerListener to the Quartz scheduler when you new it up (implement ISchedulerListener with your own class), and then listen for scheduler exceptions, and log the errors. SchedulerException wraps the original exception, so you should have access to an underlying error in there.

BTW... all of that code snippet is in another try block... but without a catch block. If you still can't find what's going on, then adding a global catch on this function and doing what they do elsewhere in the file (wrap the exception in SchedulerException and notify listeners).

Clever Neologism
  • 1,322
  • 8
  • 9
  • That's a lot of help, we'll find out soon whether this reveals the answer. – Brendan Hill Oct 02 '14 at 23:35
  • We haven't isolated the issue yet however this post has provided us with the most useful diagnostic tools to continue our investigation, so the bounty goes here. – Brendan Hill Oct 06 '14 at 04:06
2

I understand the "DoNothing" instruction tells it to skip this fire and continue with the schedule. Therefore if a misfire occurred I would expect it to fire again on the next fire time.

That is correct. The current execution will be discraded. However it should continue with the cron schedule infinitely, and create a new execution at the given time, even if there are unhandled exceptions in previous executions.

Why are our Quartz jobs failing at random times?

Some critical reasons why the jobs are misfiring could be that there are not enough worker threads to handle the jobs (this can be configured) or the scheduler itself is down. The scheduler jobs could also misfire because they are set to start at a past time.

What can we do to investigate further?

I would suspect you don't have enough worker threads configured to handle the jobs. You should also make sure you are not blocking the worker threads for longer periods of time since that could cause the worker thread pool to be exhausted and cause misfires.

In addition, if your cron schedule is not restrictive, you could try to set the misfire setting to WithMisfireHandlingInstructionFireAndProceed to continue firing the execution until it gets through.

Faris Zacina
  • 14,056
  • 7
  • 62
  • 75
0

In the past I have the same problems and as you. I don't have any clue of the origin of the error. Anyway I fix it taking the following measures:

1-) Reduce the Job work to the minimun. I use MSMQ to enqueue the JobExecution and the QUARTZ schedule only put a new message to the queue (This fix all the problem for me)

2-) You can avoid thing like Thread Sync inside the Job.

3-) Other fix that you can avoid is the QUARTZ version update. I start to face this problems when upgrade from 2.1.2 to 2.2.1

Hope this help you!

Miguel
  • 3,786
  • 2
  • 19
  • 32