2

Jenkins Controller reports : Unable to create live FilePath for i-xxxxxxxxxxxxx and Agent is marked Offline

Googling this error indicates that it is a problem with the communication paths between Controller and Agent, but what?

Background:

Jenkins Controller running v2.332.1, Java 11 64bit OS, inside a docker container Jenkins Agents running Swarm-Client jar downloaded from the Controller on startup. Swarm Plugin Version 3.32 Java 11 and 64bit OS, inside a docker container

Agents and Controller are hosted on separate EC2 instances in AWS with Security Group permissions on the relevant ports.

The Instance starts up runs the Cloud-Init, downloads the swarm-client.jar from Jenkins Controller and then runs it with the parameters required to connect to the controller. I mention this to avoid the "are you using the correct version" comments :-)

The Agent connects and is all fully online and gets busy servicing the pending Job queue.

Then some time later, indeterminate, some jobs last > 24 hours and have not failed, other jobs last minutes and sometimes fail.

Things I have tried: (some)

The Swarm Client jar can use either WebSockets and connect to the FQDN of the Jenkins controller or use the JNLP protocol to connect to the IP and dedicated agent connection port (fixed value on the Controller). Similar behavior is seen with either protocols.

Opening all the AWS Security Groups: incase there was another port, not mentioned, that needed to be open. Bypass AWS Load balancer: Agent connects directly to Controller IP:PORT via JNLP Matching Versions: Swarm Client downloaded from Controller Updated Versions: Jenkins 2.319.3, 2.332.1 Normalized Java environments: Java 11 64bit OS Enabled Logging on the Agents: periodic communications happens and then stops after a while, without obvious reason. Increased Controller Instance size: m5.xlarge -> m5.2xlarge

edwardTew
  • 99
  • 1
  • 4
  • So, it turns out that the LTS Jenkins is using a different version of the Java "Remoting" than the latest version of the Swarm-Client plugin. https://github.com/jenkinsci/swarm-plugin/releases/tag/swarm-plugin-3.31 ```Bump Remoting from 4.11.2 to 4.13 (#415, #405) @dependabot``` https://www.jenkins.io/changelog-stable/ What's new in 2.332.1 (2022-03-09) ```Update remoting from 4.11 to 4.12 to allow Java web start agents to connect (regression in 2.318). (pull 5983, issue 67000, Remoting 4.11.2 changelog, Remoting 4.12 changelog)``` – edwardTew Apr 05 '22 at 23:36

3 Answers3

0

Bumping Jenkins up to a non-LTS version allowed the connections to become more stable. Jenkins 2.341 and Swarm-Client version 3.32 both use Remoting version 4.13

Now, while I am not particularly happy about running a non-LTS version of Jenkins, I am pleased to have found a workaround

Response times of the instances is better

edwardTew
  • 99
  • 1
  • 4
  • https://issues.jenkins.io/browse/JENKINS-68122 This was causing the Jenkins agents to be marked as `Offline` due to non-response to the ping thread. Disabling the Client Ping thread allowed the agents to remain online for a longer time but eventually the Jenkins Controller got fed up and marked them `Offline`. When this happened our home grown Autoscaling Group manager script was able to signal AWS to terminate the instance. Update to **Jenkins version 2.344** has rectified this error. – edwardTew May 05 '22 at 21:06
0

Fixed by upgrading to Jenkins 2.344

edwardTew
  • 99
  • 1
  • 4
0

enter image description hereI have also struggled with this issue, I am adding details here, so, that others don't have to struggle.

This is all what i tried: we had everything running when we had JDK 8 in both master and slave. So, we added code to have JDK 11 in both and I replaced ec2 of Jenkins with a new one with help of ASG. So, issue came, and we reverted, but still the issue was the same. So, I was just assuming by this warning in jenkins as it says moveto jdk 11,as there anything like deprecated...so, I was just checking also we can try this new version of Jenkins as well, what they have mentioned. --going to Jenkins 2.344 with jdk8 ,same issue, and also to different jenkins version didn't help and I lost hope. I have tried with a biggest ec2 type for slave --didn't help I checked htop in slave --didn't help. I tried restarting jenkins master --didn't help. I tried changing remote dir for slave as mentioned in stack overflow --didn't help. So, I have a thought, as Jenkins ec2 is terminated and new ec2 came up, so, things may get updated in jenkins by that...and also warning showing to have a new version of jenkins and jdk 11..so, that looked somewhat a hope to me. I tried by increasing tomeout 20 min in slave setup, didn't help. I tried adding this command :sudo yum -y update --security in init script of node of jenkins ec2 plgin--will not help. we have tried jdk 11 image, jdk8 image and new jdk8 jenkins version image, issue was same in all.

So, what finally solved the issue: that we moved to older version of jenkins: https://hub.docker.com/layers/jenkins/jenkins/jenkins/2.330-jdk8/images/sha256-97fcb[…]17da34f0d07c021ab57083ee8c77dc4b21281d3498137?context=explore