Nomad job pending due to Docker i/o timeout on 3/4 nodes?

Question

Having the issue of my jobs ending up in an eternal pending sate due to the fact that docker pull of the container I want is hitting i/o timeouts. I've read several times about changing the DNS in order to fix this, but it seems kinda hokey, I don't need a pub google address on a private network... Here's the nomad job ping-services.nomad after a run.

○ → nomad job status ping_service
ID            = ping_service
Name          = ping_service
Submit Date   = 2019-04-25T13:29:04-07:00
Type          = service
Priority      = 50
Datacenters   = public-services,private-services,content-connector,backoffice
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group          Queued  Starting  Running  Failed  Complete  Lost
ping_service_group  0       3         1        0       4         0

Allocations
ID        Node ID   Task Group          Version  Desired  Status   Created     Modified
05468ff2  23b79904  ping_service_group  2        run      pending  18h28m ago  19s ago      <- here
5ce4c9ba  1601d6b1  ping_service_group  2        run      pending  18h28m ago  20s ago      <- here
9eced817  2260997a  ping_service_group  2        run      running  18h28m ago  18h28m ago
aefab4c3  032217e1  ping_service_group  2        run      pending  18h28m ago  42s ago      <- and here

You can see that there are only 3/4 successes, after running nomad alloc status 05468ff2

○ → nomad alloc status 05468ff2
ID                  = 05468ff2
Eval ID             = 10b76231
Name                = ping_service.ping_service_group[1]
Node ID             = 23b79904
Job ID              = ping_service
Job Version         = 2
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 18h35m ago
Modified            = 15s ago

Task "ping_service_task" is "pending"
Task Resources
CPU      Memory  Disk    IOPS  Addresses
100 MHz  20 MiB  50 MiB  0     http: xx.xxx.xxx.xxx:31215

Task Events:
Started At     = N/A
Finished At    = N/A
Total Restarts = 982
Last Restart   = 2019-04-26T15:04:01Z

Recent Events:
Time                       Type            Description
2019-04-26T08:04:28-07:00  Driver          Downloading image thobe/ping_service:0.0.9
2019-04-26T08:04:01-07:00  Restarting      Task restarting in 27.061915977s
2019-04-26T08:04:01-07:00  Driver Failure  failed to initialize task "ping_service_task" for alloc "05468ff2-f5a0-7a67-3dd7-947d4b30ec45": Failed to pull `thobe/ping_service:0.0.9`: error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556294011-ftjrcDBBZK4hiQV99v5QZXxvp34%3D: dial tcp 104.18.122.25:443: i/o timeout
2019-04-26T08:03:19-07:00  Driver          Downloading image thobe/ping_service:0.0.9
2019-04-26T08:02:51-07:00  Restarting      Task restarting in 27.302069343s
2019-04-26T08:02:51-07:00  Driver Failure  failed to initialize task "ping_service_task" for alloc "05468ff2-f5a0-7a67-3dd7-947d4b30ec45": Failed to pull `thobe/ping_service:0.0.9`: error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556293941-ZUevnKxoKohkLDGDkv5E4A79aZ8%3D: dial tcp 104.18.122.25:443: i/o timeout
2019-04-26T08:02:12-07:00  Driver          Downloading image thobe/ping_service:0.0.9
2019-04-26T08:01:46-07:00  Restarting      Task restarting in 25.629825445s
2019-04-26T08:01:46-07:00  Driver Failure  failed to initialize task "ping_service_task" for alloc "05468ff2-f5a0-7a67-3dd7-947d4b30ec45": Failed to pull `thobe/ping_service:0.0.9`: error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556293876-lE4pvy9Jsruduu76LeMoQxL0gxk%3D: dial tcp 104.18.123.25:443: i/o timeout
2019-04-26T08:01:07-07:00  Driver          Downloading image thobe/ping_service:0.0.9

You can clearly see that the issue is that there is an I/O timeout preventing us to pul our layers, so, jumping on the node, lets try this manually...

## Make sure we're really logged into ECR/Docker
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ docker login
Authenticating with existing credentials...
WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

## Attempt a manual pull... 
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ docker pull thobe/ping_service:0.0.9
0.0.9: Pulling from thobe/ping_service
ff3a5c916c92: Pulling fs layer
3c5613eb8e39: Pulling fs layer
error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556293601-mrJGlZisGPDvwapT7cAbax7UWig%3D: dial tcp 104.18.125.25:443: i/o timeout

## Are you there God?
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ ping -c1 production.cloudflare.docker.com
PING production.cloudflare.docker.com (104.18.123.25) 56(84) bytes of data.

--- production.cloudflare.docker.com ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms


## NS of Google Pub DNS
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ nslookup production.cloudflare.docker.com 8.8.8.8
;; connection timed out; no servers could be reached

## NS of Primary nameserver
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ nslookup production.cloudflare.docker.com 10.128.8.8
;; connection timed out; no servers could be reached

## NS of Secondary nameserver
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ nslookup production.cloudflare.docker.com 10.128.0.2
Server:     10.128.0.2
Address:    10.128.0.2#53

Non-authoritative answer:
Name:   production.cloudflare.docker.com
Address: 104.18.122.25
Name:   production.cloudflare.docker.com
Address: 104.18.123.25
Name:   production.cloudflare.docker.com
Address: 104.18.124.25
Name:   production.cloudflare.docker.com
Address: 104.18.125.25
Name:   production.cloudflare.docker.com
Address: 104.18.121.25

## Resolver
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ cat /etc/resolv.conf
options timeout:2 attempts:5
; generated by /usr/sbin/dhclient-script
search nomad-eu-west-1 eu-west-1.compute.internal
nameserver 10.128.8.8
nameserver 10.128.0.2

## What are our current DNS settings?
[ec2-user@ip-xx-xxx-xxx-xxx ~]$ cat /etc/resolv.conf
options timeout:2 attempts:5
; generated by /usr/sbin/dhclient-script
search nomad-eu-west-1 eu-west-1.compute.internal
nameserver 10.128.8.8
nameserver 10.128.0.2

There seems to be something going on with the bad nodes (aka, the ones that are not able to pull). Notice there seems to be an issue with the Docker Driver not being detected? Just notice this on a bad node, check out the node events....

○ → nomad node status 23b79904
ID            = 23b79904
Name          = i-xxxxxxx
Class         = <none>
DC            = public-services
Drain         = false
Eligibility   = eligible
Status        = ready
Uptime        = 21h43m20s
Driver Status = docker,exec

Node Events
Time                  Subsystem       Message
2019-04-25T20:39:48Z  Driver: docker  Driver is available and responsive
2019-04-25T20:39:03Z  Driver: docker  Driver docker is not detected
2019-04-25T18:06:53Z  Cluster         Node registered

Allocated Resources
CPU           Memory           Disk            IOPS
500/2399 MHz  128 MiB/983 MiB  300 MiB/48 GiB  0/0

Allocation Resource Utilization
CPU         Memory
5/2399 MHz  14 MiB/983 MiB

Host Resource Utilization
CPU          Memory           Disk
24/2399 MHz  410 MiB/984 MiB  1.8 GiB/50 GiB

Allocations
ID        Node ID   Task Group          Version  Desired  Status   Created     Modified
05468ff2  23b79904  ping_service_group  2        run      pending  19h19m ago  33s ago
9f9ecba6  23b79904  fabio               0        run      running  21h33m ago  21h32m ago

Good Node below....

○ → nomad node status 2260997a
ID            = 2260997a
Name          = i-xxxxxxxxx
Class         = <none>
DC            = content-connector
Drain         = false
Eligibility   = eligible
Status        = ready
Uptime        = 21h43m28s
Driver Status = docker,exec

Node Events
Time                  Subsystem  Message
2019-04-25T18:07:04Z  Cluster    Node registered

Allocated Resources
CPU           Memory          Disk           IOPS
100/2400 MHz  20 MiB/983 MiB  50 MiB/48 GiB  0/0

Allocation Resource Utilization
CPU         Memory
0/2400 MHz  6.1 MiB/983 MiB

Host Resource Utilization
CPU          Memory           Disk
23/2400 MHz  361 MiB/984 MiB  1.8 GiB/50 GiB

Allocations
ID        Node ID   Task Group          Version  Desired  Status   Created     Modified
9eced817  2260997a  ping_service_group  2        run      running  19h19m ago  19h19m ago

Nomad version below

[ec2-user@ip-xx-xxx-xxx-xxx ~]$ nomad -v
Nomad v0.8.6 (ab54ebcfcde062e9482558b7c052702d4cb8aa1b+CHANGES)

The ping command is not correct, should be: `ping production.cloudflare.docker.com` — JamesJJ, Apr 26 '19 at 15:18
Could you run some commands to check your DNS is ok or not? : `nslookup production.cloudflare.docker.com 8.8.8.8` and `nslookup production.cloudflare.docker.com 10.128.8.8` and `nslookup production.cloudflare.docker.com 10.128.0.2` — JamesJJ, Apr 26 '19 at 15:20
@JamesJJ https://gist.github.com/ehime/94a530cb2430b30aa7b7518e5303342a Also give me a second, I'm about to update my question I just saw an anomaly in Nomad/Docker due to the docker driver. — ehime, Apr 26 '19 at 15:51

Nomad job pending due to Docker i/o timeout on 3/4 nodes?

0 Answers0