I've been trying to get our CI job in Jenkins to run on spot instances in EC2 (using the Amazon EC2 plugin), and I'm having trouble figuring out how to retry consistently when they get interrupted. The test run is parallelized across several Jenkins nodes that run on EC2 instances. This is the relevant script for the pipeline:
for (int i = 0; i < numNodes; i++) {
int index = i
def nodeDisplayName = "node_${i.toString().padLeft(2, '0')}"
env["NODE_${index}_RETRY_COUNT"] = 0
nodes[nodeDisplayName] = {
retry(2) {
timeout(time: 90, unit: 'MINUTES') {
int retryCount = env["NODE_${index}_RETRY_COUNT"]
nodeLabel = (retryCount == 0) ? "ec2-spot" : "ec2-on-demand"
env["NODE_${index}_RETRY_COUNT"] = retryCount + 1
node(nodeLabel) {
stage('Debug info') {
// ...
}
stage('Run tests') {
// ...
}
}
}
}
}
}
parallel nodes
Most of the time, this works. If a spot-based node gets interrupted, it retries. But occasionally, the retry just doesn't happen. I don't see anything in the logs (or anywhere else) about why it didn't retry. Here's an example of such a run:
One thing that I've noticed is that I always see this message on the build page the same number of times as there were successful retries:
In other words, if 20 nodes were interrupted, and 19 of them were retried, I will see the "Agent was removed" mesasge 19 times. It seems like for some reason jenkins is not always detecting that the agent disappeared.
Another clue is that at the end of the logs from each node, there's a difference between what gets logged for ones that retry vs ones that didn't. On the ones that retry, the log looks like this:
Cannot contact EC2 (ec2-spot) - Jenkins Agent Image (sir-688pdhsm): hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@4b2fd30b:EC2 (ec2-spot) - Jenkins Agent Image (sir-688pdhsm)": Remote call on EC2 (ec2-spot) - Jenkins Agent Image (sir-688pdhsm) failed. The channel is closing down or has closed down
Could not connect to EC2 (ec2-spot) - Jenkins Agent Image (sir-688pdhsm) to send interrupt signal to process
for nodes that don't retry, the end of the log looks like this:
Cannot contact EC2 (ec2-spot) - Jenkins Agent Image (sir-24h6etnm): hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@63450caa:EC2 (ec2-spot) - Jenkins Agent Image (sir-24h6etnm)": Remote call on EC2 (ec2-spot) - Jenkins Agent Image (sir-24h6etnm) failed. The channel is closing down or has closed down
note that the final line from the first log does not appear. I'm not sure what this means, but I'm hoping someone else might have a clue.