So I am still struggling with this.
As a recap, I have on docker stack with 13 services running in swarm mode with two nodes (manager+worker). Each node has 4 cores and 8GB of RAM (Ubuntu 18.04, docker 19.03.12).
If a node dies, or I drain a node, all the services running on this node die and are marked are Removed
. If I simply run docker service update front_end --force
the service also dies and is marked as removed.
Another important detail is, if I sum up all the reserved memory and cores from the 13 services I end up with 1.9 cores and 4GB of RAM, way bellow each of the nodes resources.
I don't see any out of memory on the containers, services or stack logs. Also, by using htop
tool I can see that memory usage is using 647MB/7.79GB on Manager node and 2GB/7.79GB on the worker node.
This is what i tried so far:
- separated the 13 services into two different stacks. No luck.
- removed all the reservations tags from the compose files. No luck.
- tried running with 3 nodes. No luck.
- I was seeing this warning
WARNING: No swap limit support
so I followed the suggestions on this document on both nodes enter link description here. No luck.
- Upped both node resources to 8 cores and 16BG of RAM. No luck.
- tried starting each service one at the time, and I noticed it starts behaving badly with 10 or more services. That is to say, everything works fine if I have up to 9 services running, after this I see the behaviour described above.
Also, I enabled docker's debug mode to see what was happening. Here are the outputs.
If I run docker service update front_end --force
and front_end service dies, this is the output form docker events
service update k6a7go4uhexb4b1u1fp98dtke (name=frontend)
service update k6a7go4uhexb4b1u1fp98dtke (name=frontend, updatestate.new=updating)
service remove k6a7go4uhexb4b1u1fp98dtke (name=frontend)
logs from journalctl -fu docker.service
level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
level=debug msg="form data: {\"EndpointSpec\":{\"Mode\":\"vip\"},\"Labels\":{\"com.docker.stack.image\":\"registry.gitlab.com/devteam/.frontend:1.0.5\",\"com.docker.stack.namespace\":\"\",\"traefik.docker.lbswarm\":\"true\",\"traefik.docker.network\":\"net\",\"traefik.enable\":\"true\",\"traefik.http.routers.frontend.entrypoints\":\"websecure\",\"traefik.http.routers.frontend.rule\":\"Host(`www.frontend.website`)\",\"traefik.http.routers.frontend.tls.certresolver\":\"myhttpchallenge\",\"traefik.http.services.frontend.loadbalancer.server.port\":\"80\"},\"Mode\":{\"Replicated\":{\"Replicas\":1}},\"Name\":\"frontend\",\"TaskTemplate\":{\"ContainerSpec\":{\"Image\":\"registry.gitlab.com/fdevteam/frontend:1.0.5@sha256:e9a0d88bc14848c3b40c3d2905842313bbc648c1bbf09305f8935f9eb23f289a\",\"Isolation\":\"default\",\"Labels\":{\"com.docker.stack.namespace\":\"f\"},\"Privileges\":{\"CredentialSpec\":null,\"SELinuxContext\":null}},\"ForceUpdate\":1,\"Networks\":[{\"Aliases\":[\"frontend\"],\"Target\":\"w7aqg3stebnmk5c5pbhgslh2d\"}],\"Placement\":{\"Platforms\":[{\"Architecture\":\"amd64\",\"OS\":\"linux\"}]},\"Resources\":{},\"RestartPolicy\":{\"Condition\":\"any\",\"MaxAttempts\":0},\"Runtime\":\"container\"}}"
level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
...
level=debug msg="Calling GET /v1.40/tasks?filters=%7B%22_up-to-date%22%3A%7B%22true%22%3Atrue%7D%2C%22service%22%3A%7B%22frontend%22%3Atrue%7D%7D"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
level=debug msg="handleEpTableEvent UPD 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 R:{frontend.1.lv7tjjaev45pvn0f7qtppb21r frontend nnlg81dsspnj6oxip4iqwwjc3 10.0.1.73 10.0.1.74 [] [frontend] [e661c9f39097] true}"
level=debug msg="rmServiceBinding from handleEpTableEvent START for frontend 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 p:0xc004a1f880 nid:w7aqg3stebnmk5c5pbhgslh2d sKey:{nnlg81dsspnj6oxip4iqwwjc3 } deleteSvc:true"
level=debug msg="deleteEndpointNameResolution 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 frontend rm_service:false suppress:false sAliases:[frontend] tAliases:[e661c9f39097]"
level=debug msg="delContainerNameResolution 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 frontend.1.lv7tjjaev45pvn0f7qtppb21r"
level=debug msg="6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 (w7aqg3s).deleteSvcRecords(frontend.1.lv7tjjaev45pvn0f7qtppb21r, 10.0.1.74, <nil>, true) rmServiceBinding sid:6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 "
level=debug msg="6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 (w7aqg3s).deleteSvcRecords(tasks.frontend, 10.0.1.74, <nil>, false) rmServiceBinding sid:nnlg81dsspnj6oxip4iqwwjc3 "
level=debug msg="rmServiceBinding from handleEpTableEvent END for frontend 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745"
level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
level=error msg="Error getting service frontend: service frontend not found"
level=debug msg="handleEpTableEvent DEL 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 R:{frontend.1.lv7tjjaev45pvn0f7qtppb21r frontend nnlg81dsspnj6oxip4iqwwjc3 10.0.1.73 10.0.1.74 [] [frontend] [e661c9f39097] true}"
level=debug msg="rmServiceBinding from handleEpTableEvent START for frontend 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 p:0xc004a1f880 nid:w7aqg3stebnmk5c5pbhgslh2d sKey:{nnlg81dsspnj6oxip4iqwwjc3 } deleteSvc:true"
level=debug msg="deleteEndpointNameResolution 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 frontend rm_service:true suppress:false sAliases:[frontend] tAliases:[e661c9f39097]"
level=debug msg="delContainerNameResolution 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 frontend.1.lv7tjjaev45pvn0f7qtppb21r"
level=debug msg="6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 (w7aqg3s).deleteSvcRecords(frontend.1.lv7tjjaev45pvn0f7qtppb21r, 10.0.1.74, <nil>, true) rmServiceBinding sid:6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 "
level=debug msg="6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 (w7aqg3s).deleteSvcRecords(tasks.frontend, 10.0.1.74, <nil>, false) rmServiceBinding sid:nnlg81dsspnj6oxip4iqwwjc3 "
level=debug msg="6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 (w7aqg3s).deleteSvcRecords(frontend, 10.0.1.73, <nil>, false) rmServiceBinding sid:nnlg81dsspnj6oxip4iqwwjc3 "
level=debug msg="rmServiceBinding from handleEpTableEvent END for frontend 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745"
If the service does not die (that is the case with 9 or less services) this is the ouput:
service update n1wh16ru879699cpv3topcanc (name=frontend)
service update n1wh16ru879699cpv3topcanc (name=frontend, updatestate.new=updating)
service update n1wh16ru879699cpv3topcanc (name=frontend, updatestate.new=completed, updatestate.old=updating)
logs from journalctl -fu docker.service
level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
level=debug msg="form data: {\"EndpointSpec\":{\"Mode\":\"vip\"},\"Labels\":{\"com.docker.stack.image\":\"registry.gitlab.com/devteam/.frontend:1.0.5\",\"com.docker.stack.namespace\":\"\",\"traefik.docker.lbswarm\":\"true\",\"traefik.docker.network\":\"net\",\"traefik.enable\":\"true\",\"traefik.http.routers.frontend.entrypoints\":\"websecure\",\"traefik.http.routers.frontend.rule\":\"Host(`www.frontend.website`)\",\"traefik.http.routers.frontend.tls.certresolver\":\"myhttpchallenge\",\"traefik.http.services.frontend.loadbalancer.server.port\":\"80\"},\"Mode\":{\"Replicated\":{\"Replicas\":1}},\"Name\":\"frontend\",\"TaskTemplate\":{\"ContainerSpec\":{\"Image\":\"registry.gitlab.com/devteam/.frontend:1.0.5@sha256:e9a0d88bc14848c3b40c3d2905842313bbc648c1bbf09305f8935f9eb23f289a\",\"Isolation\":\"default\",\"Labels\":{\"com.docker.stack.namespace\":\"\"},\"Privileges\":{\"CredentialSpec\":null,\"SELinuxContext\":null}},\"ForceUpdate\":3,\"Networks\":[{\"Aliases\":[\"frontend\"],\"Target\":\"w7aqg3stebnmk5c5pbhgslh2d\"}],\"Placement\":{\"Platforms\":[{\"Architecture\":\"amd64\",\"OS\":\"linux\"}]},\"Resources\":{},\"RestartPolicy\":{\"Condition\":\"any\",\"MaxAttempts\":0},\"Runtime\":\"container\"}}"
level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
...
level=debug msg="Calling GET /v1.40/tasks?filters=%7B%22_up-to-date%22%3A%7B%22true%22%3Atrue%7D%2C%22service%22%3A%7B%22frontend%22%3Atrue%7D%7D"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
level=debug msg="handleEpTableEvent UPD e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 R:{frontend.1.zeq4jz8kzle4c7vtzx5ofbrqo frontend n1wh16ru879699cpv3topcanc 10.0.1.32 10.0.1.46 [] [frontend] [f986fe859440] true}"
level=debug msg="rmServiceBinding from handleEpTableEvent START for frontend e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 p:0xc005e1fa00 nid:w7aqg3stebnmk5c5pbhgslh2d sKey:{n1wh16ru879699cpv3topcanc } deleteSvc:true"
level=debug msg="deleteEndpointNameResolution e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 frontend rm_service:false suppress:false sAliases:[frontend] tAliases:[f986fe859440]"
level=debug msg="delContainerNameResolution e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 frontend.1.zeq4jz8kzle4c7vtzx5ofbrqo"
level=debug msg="e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 (w7aqg3s).deleteSvcRecords(frontend.1.zeq4jz8kzle4c7vtzx5ofbrqo, 10.0.1.46, <nil>, true) rmServiceBinding sid:e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 "
level=debug msg="e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 (w7aqg3s).deleteSvcRecords(tasks.frontend, 10.0.1.46, <nil>, false) rmServiceBinding sid:n1wh16ru879699cpv3topcanc "
level=debug msg="rmServiceBinding from handleEpTableEvent END for frontend e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98"
level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
...
level=debug msg="Calling GET /v1.40/tasks?filters=%7B%22_up-to-date%22%3A%7B%22true%22%3Atrue%7D%2C%22service%22%3A%7B%22frontend%22%3Atrue%7D%7D"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
level=debug msg="handleEpTableEvent DEL e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 R:{frontend.1.zeq4jz8kzle4c7vtzx5ofbrqo frontend n1wh16ru879699cpv3topcanc 10.0.1.32 10.0.1.46 [] [frontend] [f986fe859440] true}"
level=debug msg="rmServiceBinding from handleEpTableEvent START for frontend e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 p:0xc005e1fa00 nid:w7aqg3stebnmk5c5pbhgslh2d sKey:{n1wh16ru879699cpv3topcanc } deleteSvc:true"
level=debug msg="deleteEndpointNameResolution e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 frontend rm_service:true suppress:false sAliases:[frontend] tAliases:[f986fe859440]"
level=debug msg="delContainerNameResolution e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 frontend.1.zeq4jz8kzle4c7vtzx5ofbrqo"
level=debug msg="e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 (w7aqg3s).deleteSvcRecords(frontend.1.zeq4jz8kzle4c7vtzx5ofbrqo, 10.0.1.46, <nil>, true) rmServiceBinding sid:e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 "
level=debug msg="e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 (w7aqg3s).deleteSvcRecords(tasks.frontend, 10.0.1.46, <nil>, false) rmServiceBinding sid:n1wh16ru879699cpv3topcanc "
level=debug msg="e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 (w7aqg3s).deleteSvcRecords(frontend, 10.0.1.32, <nil>, false) rmServiceBinding sid:n1wh16ru879699cpv3topcanc "
level=debug msg="rmServiceBinding from handleEpTableEvent END for frontend e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98"
level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
...
level=debug msg="Calling GET /v1.40/tasks?filters=%7B%22_up-to-date%22%3A%7B%22true%22%3Atrue%7D%2C%22service%22%3A%7B%22frontend%22%3Atrue%7D%7D"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
level=debug msg="handleEpTableEvent ADD 521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0 R:{frontend.1.1v9ggahd87x2ydlkna0qx7jmz frontend n1wh16ru879699cpv3topcanc 10.0.1.32 10.0.1.47 [] [frontend] [3671840709bb] false}"
level=debug msg="addServiceBinding from handleEpTableEvent START for frontend 521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0 p:0xc004a1ed80 nid:w7aqg3stebnmk5c5pbhgslh2d skey:{n1wh16ru879699cpv3topcanc }"
level=debug msg="addEndpointNameResolution 521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0 frontend add_service:true sAliases:[frontend] tAliases:[3671840709bb]"
level=debug msg="addContainerNameResolution 521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0 frontend.1.1v9ggahd87x2ydlkna0qx7jmz"
level=debug msg="521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0 (w7aqg3s).addSvcRecords(frontend.1.1v9ggahd87x2ydlkna0qx7jmz, 10.0.1.47, <nil>, true) addServiceBinding sid:521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0"
level=debug msg="521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0 (w7aqg3s).addSvcRecords(tasks.frontend, 10.0.1.47, <nil>, false) addServiceBinding sid:n1wh16ru879699cpv3topcanc"
level=debug msg="521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0 (w7aqg3s).addSvcRecords(frontend, 10.0.1.32, <nil>, false) addServiceBinding sid:n1wh16ru879699cpv3topcanc"
level=debug msg="addServiceBinding from handleEpTableEvent END for frontend 521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0"
level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
...
level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService