Jenkins pipelines do not resume properly after Jenkins restart

Question

Issue Summary:

Jenkins LTS + The Durable Task plugin does not properly resume a pipeline job if the Jenkins service is restarted during the task run.

This is a regression in Jenkins 2.3x and seems to coincide with the migration to systemd (it used to work perfectly fine in 2.2x).

Related Jenkins Issue Link: https://issues.jenkins.io/browse/JENKINS-69061

Steps to reproduce the issue:

Start with a single node Jenkins host with the durable task plugin installed.
Start a pipeline job on the host. I've included a sample pipeline file at the bottom of this question.
While running, restart the jenkins service "service jenkins restart" ( OR using jenkins-cli.jar to restart )
After Jenkins starts, the task attempts to resume, but instead eventually fails (log below).

Resuming build at Tue Jul 19 23:26:56 UTC 2022 after Jenkins restart
Waiting to resume part of test-job #5: Waiting for next available executor
Ready to run at Tue Jul 19 23:27:01 UTC 2022
wrapper script does not seem to be touching the log file in /data/jenkins_home/workspace/test-job@tmp/durable-b0167617
(JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

After the above message throws, the job goes into a "failed" state.

Manually touching/writing to the mentioned log file does not resolve the problem.
The issue is not the filesystem nor available memory as other solutions have mentioned in related tickets/posts. (This is a regression in the latest versions of Jenkins.)
There are no available plugin updates (fully up to date).
This seemed to happen when we got on the 2.332 version which also included the migration to systemd. So, there is a possibility that the service restart using systemd (versus the old init system used previous to 2.332) is breaking the durable tasks.

This issue has been filed on the Jenkins official tracker: https://issues.jenkins.io/browse/JENKINS-69061

However, nobody has responded to that report in over 2 months so I'm asking if anyone here has any idea what the issue could be, to find potential workarounds, and to overall increase visibility/traction on the problem.

Example minimal/simple pipeline used in testing this issue:

pipeline {
  agent any

  stages {

    stage("Sleep for 60 seconds") {
      steps {

        echo "Go restart jenkins service now and see that this job wont resume"

        sh "sleep 60"

        echo "The job will never get this far"

      }
    }
  }
}

Have you tried the suggested change for `-Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL`? For testing, I would configure it to 15 minutes and then to 1h. This option needs to be in the command line arguments of JVM. See: https://stackoverflow.com/a/50100551/290087 — Mircea Vutcovici, Oct 29 '22 at 12:35
Yes, it doesn't matter how long the heartbeat is set for. The only solution I've found is to not use the built-in agent you must spawn a custom agent even on a single node install. Makes for very annoying dev environments. The whole thing broke completely in 2.3x — Rino Bino, Oct 31 '22 at 19:49

score 0 · Answer 1 · answered Nov 02 '22 at 16:07

Answering my own question with the info I have, however this isn't really a solution to the regression bug.

Cause Summary:

Jenkins team seems to have broken this functionality in 2.332.1. My theory is that it was due to the init system migration documented here: https://www.jenkins.io/blog/2022/03/25/systemd-migration/

Workaround (Solution):

This issue seems to only affect the "Built-In" build agent that is included on every Jenkins controller. You need to disable the built-in agent and add your own custom agent.

Create an agent:
- Do this even for a single node standalone install (yes, annoying for dev environments)
- Regular Java agents work. However I had a hard time getting the jnlp Java agent to bypass the $JENKINS_URL value even after trying all sorts of workarounds and it's impossible to have jnlp to just connect to "localhost" afaik. So I could not get this working on my production host that has a reverse proxy and 2FA in front of the primary $JENKINS_URL.
  - https://www.jenkins.io/doc/book/using/using-agents/
  - Jenkins agent is not honoring hudson.TcpSlaveAgentListener.hostName
- SSH Agent also works, and is what I eventually used. The controller will basically just ssh into itself: https://plugins.jenkins.io/ssh-slaves/
Disable the built-in agent:
- Once the new agent is online and working, you need to either configure your jobs to use the new agent or (easier) set the number of executors on the Built-In agent to 0.

Jenkins pipelines do not resume properly after Jenkins restart

1 Answers1