Nomad job stuck in pending state due to failed Consul KV resolution in template

Question

Nomad v1.0.4, Consul v1.7.3

We have a Nomad job spec with several task groups. Each task group has a single task. Each task has the same template stanza that references several Consul KV paths like so:

{{ if keyExists "services/mysql/database" }}
  MYSQL_DB = "{{ key "services/mysql/database" }}"
{{ end }}

The Nomad job spec is programmatically generated in JSON format and submitted to the Nomad cluster via POST /jobs. All tasks in this job are constrained to run on the same host machine.

We are seeing that some (not all) of the tasks become stuck in a pending state with allocation errors such as:

[1] Template failed: kv.block(services/mysql/database): Get "http://127.0.0.1:8500/v1/kv/services/mysql/database?index=1328&stale=&wait=60000ms": EOF

or

[2] Missing: kv.block(services/mysql/database)

Note that the specific Consul KV path indicated in the allocation error message is non-deterministic. As mentioned above, every job uses the same template stanza, and the template stanza itself references several Consul KV paths. For each failed allocation, the indicated Consul KV path in the allocation error may be different.

We've verified that the Consul cluster is alive and that all the KV paths referenced in the template stanza exist.

In theory, all the tasks should have encountered the same fate (e.g. failed) if either the Consul HTTP request was bad or the Consul KV path did not exist. As mentioned, only some of the tasks failed while the others successfully entered a running state. From this, we know that the template stanza itself is valid since at least some of the jobs are able to run successfully.

We verified that the Consul HTTP request was working by running it directly via cURL.

Interestingly, some of the failed tasks recover automatically over time when they are rescheduled in the future. However, others simply remain in a pending state forever.

Any insights about this behavior or possible solutions to explore are greatly appreciated.

score 1 · Answer 1 · answered Aug 26 '21 at 13:36

Consul limits number of concurrent HTTP connections from single IP. You can try to check that. In my nomad/consul deployment I had similar issue. First 20 tasks could start on a particular node but then the 21st failed to start because it could not read KV entry (but other 20 could read the same entry). It was behaving very strangely. Mentioned limit solved my issue.

Btw. I was skeptical to increase that limit from 200, it seemed high enough. However, it appears to me that one nomad task opens multiple consul HTTP connections, so my 20 tasks could quickly exhaust limit of 200.

Nomad job stuck in pending state due to failed Consul KV resolution in template

1 Answers1