12

I'm running tasks periodically and to provide flexibility for the intervals, the next timeout is calculated at the end of each task, converted to milliseconds from Instant.now(), and scheduled using ScheduledExecutorService#schedule.

This code is generally working fine (blue curve on the left), but other days not so good.

enter image description here

It appears to me that things sometimes go bad at startup (machines are restarted every night) and though that the program should, and does, correct itself ScheduledExecutorService#schedule does not recover and scheduled tasks run late all the time. It appears that a full JVM restart is the only solution.

My initial thoughts was that this was a bug and depending of the time of the machine startup, things could go wrong. But the following log output demonstrates that the problem is related to my usage of ScheduledExecutorService#schedule :

// Log time in GMT+2, other times are in GMT
// The following lines are written following system startup (all times are correct)
08 juin 00:08:49.993 [main] WARN  com.pgscada.webdyn.Webdyn - Scheduling next webdyn service time. Currently 2018-06-07T22:08:49.993Z, last connection null
08 juin 00:08:50.586 [main] INFO  com.pgscada.webdyn.Webdyn - The next data sample at 2018-06-07T22:10:00Z and the next FTP connection at 2018-06-07T22:30:00Z
08 juin 00:08:50.586 [main] WARN  com.pgscada.webdyn.Webdyn - Completed webdyn schedule in 9ms, next execution at 2018-06-07T22:10:00Z (in 69414 ms) will run as data-sample
// So we are expecting the next execution to occur at 00:10:00 (or in 69.4 seconds)
// Except that it runs at 00:11:21
08 juin 00:11:21.206 [pool-1-thread-4] INFO  com.pgscada.webdyn.Webdyn - Executing Webdyn service, isDataSample=true, isFtpConnection=false, nextTimeout=2018-06-07T22:10:00Z, lastFtpConnection=null
// But thats OK because it should correct itself
08 juin 00:13:04.151 [pool-1-thread-4] WARN  com.pgscada.webdyn.Webdyn - Scheduling next webdyn service time. Currently 2018-06-07T22:10:00Z, last connection null
08 juin 00:13:04.167 [pool-1-thread-4] INFO  com.pgscada.webdyn.Webdyn - The next data sample at 2018-06-07T22:20:00Z and the next FTP connection at 2018-06-07T22:30:00Z
08 juin 00:13:04.167 [pool-1-thread-4] WARN  com.pgscada.webdyn.Webdyn - Completed webdyn schedule in 0ms, next execution at 2018-06-07T22:20:00Z (in 415833 ms) will run as data-sample
// So now we are expecting the next execution to occur at 00:20:00 (or in 415.8 seconds)
// But it runs at 00:28:06
08 juin 00:28:06.145 [pool-1-thread-4] INFO  com.pgscada.webdyn.Webdyn - Executing Webdyn service, isDataSample=true, isFtpConnection=false, nextTimeout=2018-06-07T22:20:00Z, lastFtpConnection=null

Below is actual production code of the scheduling function.

ScheduledExecutorService EXECUTORS = Executors.newScheduledThreadPool(10);


private void scheduleNextTimeout(Instant currentTime, Instant lastFtpConnection) {

    try {

        log.info("Scheduling next webdyn service time. Currently {}, last connection {}", currentTime, lastFtpConnection);

        // Parse config files first
        getConfigIni().parse();

        long time = System.nanoTime();
        final Instant earliestPossibleTimeout = Instant.now().plusSeconds(5); 

        Instant nextDataSample = nextTimeout(currentTime);

        if (nextDataSample.isBefore(earliestPossibleTimeout)) {
            final Instant oldTime = nextDataSample;
            nextDataSample = nextTimeout(earliestPossibleTimeout);
            log.warn("Next data sample was calculated to a time in the past '{}', resetting to a future time: {}", oldTime, nextDataSample);
        }

        Instant nextFtp = nextFtpConnection(currentTime, lastFtpConnection);

        if (nextFtp.isBefore(earliestPossibleTimeout)) {
            final Instant oldTime = nextFtp;
            nextFtp = nextFtpConnection(earliestPossibleTimeout, lastFtpConnection);
            log.warn("Next FTP connection was calculated to a time in the past '{}', resetting to a future time: {}", oldTime, nextFtp);
        }

        final boolean isFtpConnection = !nextDataSample.isBefore(nextFtp);
        final boolean isDataSample = !isFtpConnection || nextDataSample.equals(nextFtp);
        log.info("The next data sample at {} and the next FTP connection at {}", nextDataSample, nextFtp);

        final Instant nextTimeout = nextDataSample.isBefore(nextFtp) ? nextDataSample : nextFtp;
        final long millis = Duration.between(Instant.now(), nextTimeout).toMillis();
        EXECUTORS.schedule(() -> {
            log.info("Executing Webdyn service, isDataSample={}, isFtpConnection={}, nextTimeout={}, lastFtpConnection={}",
                    isDataSample, isFtpConnection, nextTimeout, lastFtpConnection);

            long tme = System.nanoTime();
            try {
                connect(isDataSample, isFtpConnection, nextTimeout, lastFtpConnection);
                log.warn("Completed webdyn service in {}s", (System.nanoTime() - tme) / 1000000);
            } catch (final Throwable ex) {
                log.error("Failed webdyn service after {}ms : {}", (System.nanoTime() - tme) / 1000000, ex.getMessage(), ex);
            } finally {
                scheduleNextTimeout(nextTimeout, isFtpConnection ? nextTimeout : lastFtpConnection);
            }
        }, millis, TimeUnit.MILLISECONDS);

        log.warn("Completed webdyn schedule in {}ms, next execution at {} (in {} ms) will run as {}",
                (System.nanoTime() - time) / 1000000, nextTimeout, millis, isFtpConnection ? "ftp-connection" : "data-sample");

    } catch (final Throwable ex) {
        log.error("Fatal error in webdyn schedule : {}", ex.getMessage(), ex);
    }
}
Caffeinated
  • 11,982
  • 40
  • 122
  • 216
klonq
  • 3,535
  • 4
  • 36
  • 58
  • So the scheduleNextTimeout method is executed from various threads (as each task calls it at its end to cause new scheduling), right? And it changes state of EXECUTORS by calling EXECUTORS.schedule. So you have shared resource (EXECUTORS) changed from multiple threads and if it's not thread safe, then weird things may occur. If the implementation supplied by Executors.newScheduledThreadPool(10) is not thread safe, you may get into trouble. So I'd try calling EXECUTORS.schedule from synchronized block. Maybe just make whole scheduleNextTimeout synchronized for start. Hope it helps. – Igand Jun 14 '18 at 09:00
  • Thanks I will try that. The function is executed from exactly 2 threads - the main thread (at system startup) and then by it's own thread (after each execution). I increased the thread pool to 10 so that I was sure that there would be a free thread at each timeout but it's unnecessary and it seems the problem is elsewhere. – klonq Jun 14 '18 at 10:04
  • Even those 2 threads are absolutely enough for this kind of problems ('ve been there, done that, tore my hair out) - it's because some changes made in one thread may not ever become visible to other thread, no matter how much later it runs, if synchronization is not done correctly. Even if there is only 1 thread accessing the shared resource at any given time, you'd still get concurency issues. So my bet is still on this kind of problem, but surely - I might be wrong. I'll post proper answer later, perhaps with some links, if my suggestion works. – Igand Jun 14 '18 at 10:16
  • Could you share the full application log? I didn't see any log about `"Completed webdyn service in {}s"`, any it would be great if you share your `nextFtpConnection` and `nextTimeout` function too – Mạnh Quyết Nguyễn Jun 16 '18 at 06:21
  • 1
    What do the graphs really mean? One is plotting kW against time, and the other W/m^2. What does this have to do with the question you are asking? – Stephen C Jun 17 '18 at 06:10
  • Thanks this issue is solved, it was a mutlithreading issue after all. I can't delete the question though because I put a bounty on it ... – klonq Jun 17 '18 at 12:38
  • 1
    This site really doen't work on people deleting solved problems :) I'll form a proper answer, I'll be glad if you accept it. – Igand Jun 17 '18 at 12:47

1 Answers1

6

As stated in my comment bellow the question, the issue here is that there is a shared, mutable and not-thread-safe resource (EXECUTORS atribute) altered by more than one thread. It is altered by main thread at start and whichever thread is used from the pool for task execution.

The thing to note is, that even when you have only one thread accessing a shared resource at a time (simply because there is only one task running at a time), you still need to think about concurency. It is because without synchronization Java Memory Model does not guarantee the changes made by one thread to be ever visible to other threads, no matter how much later they run.

So the solution would be making method scheduleNextTimeout synchronized, thus guaranteeing changes not to be kept local to executing thread and written to main memory.

You also could make a synchronized block (synchronized on "this") around the part, which makes access to shared resource, but since the system doesn't seem to be heavy duty one and the rest of code doesn't seem to take a long time, there is no need for that...

Here is very gist of it in nice and short article I once learned from, when confronted with this sort of issues for the first time :) https://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#jsr133

I'm glad I could help.

Igand
  • 1,161
  • 1
  • 15
  • 25