Having the issue of my jobs ending up in an eternal pending sate due to the fact that docker pull of the container I want is hitting i/o timeouts. I've read several times about changing the DNS in order to fix this, but it seems kinda hokey, I don't need a pub google address on a private network...
Here's the nomad job ping-services.nomad
after a run.
○ → nomad job status ping_service
ID = ping_service
Name = ping_service
Submit Date = 2019-04-25T13:29:04-07:00
Type = service
Priority = 50
Datacenters = public-services,private-services,content-connector,backoffice
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
ping_service_group 0 3 1 0 4 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
05468ff2 23b79904 ping_service_group 2 run pending 18h28m ago 19s ago <- here
5ce4c9ba 1601d6b1 ping_service_group 2 run pending 18h28m ago 20s ago <- here
9eced817 2260997a ping_service_group 2 run running 18h28m ago 18h28m ago
aefab4c3 032217e1 ping_service_group 2 run pending 18h28m ago 42s ago <- and here
You can see that there are only 3/4 successes, after running nomad alloc status 05468ff2
○ → nomad alloc status 05468ff2
ID = 05468ff2
Eval ID = 10b76231
Name = ping_service.ping_service_group[1]
Node ID = 23b79904
Job ID = ping_service
Job Version = 2
Client Status = pending
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created = 18h35m ago
Modified = 15s ago
Task "ping_service_task" is "pending"
Task Resources
CPU Memory Disk IOPS Addresses
100 MHz 20 MiB 50 MiB 0 http: xx.xxx.xxx.xxx:31215
Task Events:
Started At = N/A
Finished At = N/A
Total Restarts = 982
Last Restart = 2019-04-26T15:04:01Z
Recent Events:
Time Type Description
2019-04-26T08:04:28-07:00 Driver Downloading image thobe/ping_service:0.0.9
2019-04-26T08:04:01-07:00 Restarting Task restarting in 27.061915977s
2019-04-26T08:04:01-07:00 Driver Failure failed to initialize task "ping_service_task" for alloc "05468ff2-f5a0-7a67-3dd7-947d4b30ec45": Failed to pull `thobe/ping_service:0.0.9`: error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556294011-ftjrcDBBZK4hiQV99v5QZXxvp34%3D: dial tcp 104.18.122.25:443: i/o timeout
2019-04-26T08:03:19-07:00 Driver Downloading image thobe/ping_service:0.0.9
2019-04-26T08:02:51-07:00 Restarting Task restarting in 27.302069343s
2019-04-26T08:02:51-07:00 Driver Failure failed to initialize task "ping_service_task" for alloc "05468ff2-f5a0-7a67-3dd7-947d4b30ec45": Failed to pull `thobe/ping_service:0.0.9`: error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556293941-ZUevnKxoKohkLDGDkv5E4A79aZ8%3D: dial tcp 104.18.122.25:443: i/o timeout
2019-04-26T08:02:12-07:00 Driver Downloading image thobe/ping_service:0.0.9
2019-04-26T08:01:46-07:00 Restarting Task restarting in 25.629825445s
2019-04-26T08:01:46-07:00 Driver Failure failed to initialize task "ping_service_task" for alloc "05468ff2-f5a0-7a67-3dd7-947d4b30ec45": Failed to pull `thobe/ping_service:0.0.9`: error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556293876-lE4pvy9Jsruduu76LeMoQxL0gxk%3D: dial tcp 104.18.123.25:443: i/o timeout
2019-04-26T08:01:07-07:00 Driver Downloading image thobe/ping_service:0.0.9
You can clearly see that the issue is that there is an I/O timeout preventing us to pul our layers, so, jumping on the node, lets try this manually...
## Make sure we're really logged into ECR/Docker
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ docker login
Authenticating with existing credentials...
WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
## Attempt a manual pull...
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ docker pull thobe/ping_service:0.0.9
0.0.9: Pulling from thobe/ping_service
ff3a5c916c92: Pulling fs layer
3c5613eb8e39: Pulling fs layer
error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556293601-mrJGlZisGPDvwapT7cAbax7UWig%3D: dial tcp 104.18.125.25:443: i/o timeout
## Are you there God?
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ ping -c1 production.cloudflare.docker.com
PING production.cloudflare.docker.com (104.18.123.25) 56(84) bytes of data.
--- production.cloudflare.docker.com ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
## NS of Google Pub DNS
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ nslookup production.cloudflare.docker.com 8.8.8.8
;; connection timed out; no servers could be reached
## NS of Primary nameserver
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ nslookup production.cloudflare.docker.com 10.128.8.8
;; connection timed out; no servers could be reached
## NS of Secondary nameserver
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ nslookup production.cloudflare.docker.com 10.128.0.2
Server: 10.128.0.2
Address: 10.128.0.2#53
Non-authoritative answer:
Name: production.cloudflare.docker.com
Address: 104.18.122.25
Name: production.cloudflare.docker.com
Address: 104.18.123.25
Name: production.cloudflare.docker.com
Address: 104.18.124.25
Name: production.cloudflare.docker.com
Address: 104.18.125.25
Name: production.cloudflare.docker.com
Address: 104.18.121.25
## Resolver
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ cat /etc/resolv.conf
options timeout:2 attempts:5
; generated by /usr/sbin/dhclient-script
search nomad-eu-west-1 eu-west-1.compute.internal
nameserver 10.128.8.8
nameserver 10.128.0.2
## What are our current DNS settings?
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ cat /etc/resolv.conf
options timeout:2 attempts:5
; generated by /usr/sbin/dhclient-script
search nomad-eu-west-1 eu-west-1.compute.internal
nameserver 10.128.8.8
nameserver 10.128.0.2
There seems to be something going on with the bad nodes (aka, the ones that are not able to pull). Notice there seems to be an issue with the Docker Driver
not being detected? Just notice this on a bad node, check out the node events....
○ → nomad node status 23b79904
ID = 23b79904
Name = i-xxxxxxx
Class = <none>
DC = public-services
Drain = false
Eligibility = eligible
Status = ready
Uptime = 21h43m20s
Driver Status = docker,exec
Node Events
Time Subsystem Message
2019-04-25T20:39:48Z Driver: docker Driver is available and responsive
2019-04-25T20:39:03Z Driver: docker Driver docker is not detected
2019-04-25T18:06:53Z Cluster Node registered
Allocated Resources
CPU Memory Disk IOPS
500/2399 MHz 128 MiB/983 MiB 300 MiB/48 GiB 0/0
Allocation Resource Utilization
CPU Memory
5/2399 MHz 14 MiB/983 MiB
Host Resource Utilization
CPU Memory Disk
24/2399 MHz 410 MiB/984 MiB 1.8 GiB/50 GiB
Allocations
ID Node ID Task Group Version Desired Status Created Modified
05468ff2 23b79904 ping_service_group 2 run pending 19h19m ago 33s ago
9f9ecba6 23b79904 fabio 0 run running 21h33m ago 21h32m ago
Good Node below....
○ → nomad node status 2260997a
ID = 2260997a
Name = i-xxxxxxxxx
Class = <none>
DC = content-connector
Drain = false
Eligibility = eligible
Status = ready
Uptime = 21h43m28s
Driver Status = docker,exec
Node Events
Time Subsystem Message
2019-04-25T18:07:04Z Cluster Node registered
Allocated Resources
CPU Memory Disk IOPS
100/2400 MHz 20 MiB/983 MiB 50 MiB/48 GiB 0/0
Allocation Resource Utilization
CPU Memory
0/2400 MHz 6.1 MiB/983 MiB
Host Resource Utilization
CPU Memory Disk
23/2400 MHz 361 MiB/984 MiB 1.8 GiB/50 GiB
Allocations
ID Node ID Task Group Version Desired Status Created Modified
9eced817 2260997a ping_service_group 2 run running 19h19m ago 19h19m ago
Nomad version below
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ nomad -v
Nomad v0.8.6 (ab54ebcfcde062e9482558b7c052702d4cb8aa1b+CHANGES)