Nomad v1.0.4, Consul v1.7.3
We have a Nomad job spec with several task groups. Each task group has a single task. Each task has the same template stanza that references several Consul KV paths like so:
{{ if keyExists "services/mysql/database" }}
MYSQL_DB = "{{ key "services/mysql/database" }}"
{{ end }}
The Nomad job spec is programmatically generated in JSON format and submitted to the Nomad cluster via POST /jobs. All tasks in this job are constrained to run on the same host machine.
We are seeing that some (not all) of the tasks become stuck in a pending
state with allocation errors such as:
[1] Template failed: kv.block(services/mysql/database): Get "http://127.0.0.1:8500/v1/kv/services/mysql/database?index=1328&stale=&wait=60000ms": EOF
or
[2] Missing: kv.block(services/mysql/database)
Note that the specific Consul KV path indicated in the allocation error message is non-deterministic. As mentioned above, every job uses the same template stanza, and the template stanza itself references several Consul KV paths. For each failed allocation, the indicated Consul KV path in the allocation error may be different.
We've verified that the Consul cluster is alive and that all the KV paths referenced in the template stanza exist.
In theory, all the tasks should have encountered the same fate (e.g. failed) if either the Consul HTTP request was bad or the Consul KV path did not exist. As mentioned, only some of the tasks failed while the others successfully entered a running
state. From this, we know that the template stanza itself is valid since at least some of the jobs are able to run successfully.
We verified that the Consul HTTP request was working by running it directly via cURL.
Interestingly, some of the failed tasks recover automatically over time when they are rescheduled in the future. However, others simply remain in a pending
state forever.
Any insights about this behavior or possible solutions to explore are greatly appreciated.