9

I'm developing scheduled services.

The application is developed using JDK 1.6, Spring Framework 2.5.6 and Quartz 1.8.4 to schedule jobs.

I've two clustered servers with WebLogic Server 10.3.5.

Sometimes it seems that the scheduling of quartz goes crazy. Analyzing the conditions in which it occurs, there seems to be a clock "desynchronization" greater than a second between the clustered servers. However this desynchronization is not always due to the system time of the servers, sometimes it seems that even if the clocks of the machines are synchronized, there is a little "delay" introduced by the JVM.

Has anyone encountered the same problem? Is there a way to solve it?

Thanks in advance

Claudio Query
  • 323
  • 1
  • 4
  • 13

5 Answers5

5

When using a JDBC-JobStore on Oracle with version 2.2.1, I experienced the same problem.

In my case, I was running Quartz on a single node. However, I noticed the database machine was not time synchronized with the node running Quartz.

I activated ntpd on both the database machine and the machine running Quartz, and the problem went away after a few minutes.

JasonMArcher
  • 14,195
  • 22
  • 56
  • 52
ercasta
  • 51
  • 1
  • 2
5

The issue is most often happens because of de-synchronisation of time in cluster nodes. However it also may be caused by unstable connection of application to DB. Such connection problems may be caused by network problems (if application server and DB server are on different machines) or performance problems (DB server processes requests very slowly by some reason).

In such case chances of appearance of this issue may be reduced by increasing org.quartz.jobStore.clusterCheckinInterval value.

Cloud
  • 983
  • 11
  • 11
3

This issue is nearly always attributable to clock-skew. Even if you think you have NTPd setup properly a couple of things can still happen:

  • We thought we had NTPd working (and it was configured properly) but on AWS the firewalls were blocking the NTP ports. UDP 123. Again, that's UDP not TCP.
  • If you don't sync often enough you will accumulate clock-skew. The accuracy of the timers on many motherboards is notoriously wonky. Thus over time (days) suddenly you get these Quartz errors. Over 5 minutes and you get many security errors like Kerberos for example.

So the moral of this story is sync with NTPd but do it often and verify it is actually working.

sagneta
  • 1,546
  • 18
  • 26
3

I faced the same issue. Firstly you should check the logs and time sync for your cluster.

The marker is messages in logs:

08-02-2018 17:13:49.926 [QuartzScheduler_schedulerService-pc6061518092456074_ClusterManager] INFO  o.s.s.quartz.LocalDataSourceJobStore - ClusterManager: detected 1 failed or restarted instances.

08-02-2018 17:14:06.137 [QuartzScheduler_schedulerService-pc6061518092765988_ClusterManager] WARN  o.s.s.quartz.LocalDataSourceJobStore - This scheduler instance (pc6061518092765988) is still active but was recovered by another instance in the cluster.

When the first node observed that the second node is absent more than org.quartz.jobStore.clusterCheckinInterval it unregistered the second node from the cluster and removed all its triggers.

Take a look to the synchronization algorithm: org.quartz.impl.jdbcjobstore.JobStoreSupport.ClusterManager#run

It may happen when 'check in' takes long time.

My solution is to override org.quartz.impl.jdbcjobstore.JobStoreSupport#calcFailedIfAfter. The hardcoded value '7500L' looks like as the grace period. I replaced it as parameter.

Note: If you using SchedulerFactoryBean be careful with registering new JobStoreSupport subclass. The Spring forcibly register own store org.springframework.scheduling.quartz.LocalDataSourceJobStore.

Stephen Kennedy
  • 20,585
  • 22
  • 95
  • 108
Andrei Kovrov
  • 2,087
  • 1
  • 18
  • 28
2

I am using Quartz 2.2.1 and I notice a strange behavior whenever a cluster recovery occurs.

For instance, even if the machines have been synchronized with ntpdate service I obtain this message on cluster instance recovery:

org.quartz.impl.jdbcjobstore.JobStoreSupport findFailedInstances “This scheduler instance () is still active but was recovered by another instance in the cluster. This may cause inconsistent behavior”.

Here says that the solution is: "Synchronize the time on all cluster nodes and then restart the cluster. The messages should no longer appear in the log."

As every machine is synchronized maybe this "delay" is introduced by the JVM?? I don´t know...:(

aloplop85
  • 892
  • 3
  • 16
  • 40