GCP Kubernetes engine - crash of nginx-ingress-controller after large file upload

Question

I'm trying luck here to solve my problem happening on Google Cloud Kubernetes Engine.

Problem in short: When I upload file via my PHP application of 15-20MB, nginx ingress controller crashes, disk IO goes rapidly UP, then CPU goes up and takes about 5-30 minut until IO and CPU goes down and all sucessfully restarts.

Here are logs from nginx-ingress-controller containers of all what is happening with my comments:

Successfully received upload in app:

INFO 2020-02-14 14:30:55.481 CET 10.102.1.1 - [10.102.1.1] - - [14/Feb/2020:13:30:55 +0000] "POST /api/v1/contracts/38141/file-system/upload HTTP/2.0" 499 0

NGINX start to produces a tons of logs like this:

INFO 2020-02-14 14:30:55.819 CET *�I�g�*��\u001AnK67�@?+�(%u052f��O�yqq$+u$,�b�<*�9#\t��\u0003d\u0006+����I�]A�%u0110jv��hAp\"�63�9\u0019Q�{�x|K�\u000BE\u001C��\"-P%u0079�\u001Ed�Tv

After many lines there are logs about ingress endpoints are not available:

WARN 2020-02-14T13:31:05.505984Z Service "gitlab-managed-apps/ingress-nginx-ingress-default-backend" does not have any active Endpoint 
WARN 2020-02-14 14:31:05.526 CET Service "my-app/my-app" does not have any active Endpoint.
WARN 2020-02-14 14:31:05.526 CET Service "my-app/app-staging" does not have any active Endpoint.

... skipped access logs ...

WARN 2020-02-14 14:32:34.419 CET failed to renew lease gitlab-managed-apps/ingress-controller-leader-nginx: failed to tryAcquireOrRenew context deadline exceeded
2020-02-14 14:32:42.227 CET attempting to acquire leader lease gitlab-managed-apps/ingress-controller-leader-nginx...
ERROR 2020-02-14 14:32:43.464 CET Failed to update lock: Operation cannot be fulfilled on configmaps "ingress-controller-leader-nginx": the object has been modified; please apply your changes to the latest version and try again

Now is happening another file upload by client and again tons of logs of symbols... and after this log of symbols there is logged:

INFO 2020-02-14T13:33:37.525466Z Received SIGTERM, shutting down 
INFO 2020-02-14T13:33:55.513100Z Received SIGTERM, shutting down 
INFO 2020-02-14T13:33:55.513155Z Shutting down controller queues 
INFO 2020-02-14T13:33:55.516017Z updating status of Ingress rules (remove) 
ERROR 2020-02-14T13:33:55.570340Z healthcheck error: Get http+unix://nginx-status/healthz: read unix @->/tmp/nginx-status-server.sock: i/o timeout 
INFO 2020-02-14T13:33:55.574690Z Shutting down controller queues 
INFO 2020-02-14T13:33:55.576049Z updating status of Ingress rules (remove) 
ERROR 2020-02-14T13:33:55.610722Z healthcheck error: Get http+unix://nginx-status/healthz: read unix @->/tmp/nginx-status-server.sock: i/o timeout 
ERROR 2020-02-14T13:33:55.774881Z healthcheck error: Get http+unix://nginx-status/healthz: read unix @->/tmp/nginx-status-server.sock: i/o timeout 
INFO 2020-02-14T13:33:55.776321Z failed to renew lease gitlab-managed-apps/ingress-controller-leader-nginx: failed to tryAcquireOrRenew context deadline exceeded 
INFO 2020-02-14T13:33:55.781376Z attempting to acquire leader lease  gitlab-managed-apps/ingress-controller-leader-nginx... 
INFO 2020-02-14T13:33:56.826124Z successfully acquired lease gitlab-managed-apps/ingress-controller-leader-nginx 
INFO 2020-02-14T13:33:56.833827Z new leader elected: ingress-nginx-ingress-controller-756f8d9cbb-86xnh 
ERROR 2020-02-14T13:33:56.933107Z queue has been shutdown, failed to enqueue: &ObjectMeta{Name:sync status,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,ManagedFields:[],} 
INFO 2020-02-14T13:33:58.027600Z new leader elected: ingress-nginx-ingress-controller-756f8d9cbb-86xnh 
ERROR 2020-02-14T13:33:58.117920Z Failed to update lock: Operation cannot be fulfilled on configmaps "ingress-controller-leader-nginx": the object has been modified; please apply your changes to the latest version and try again 
INFO 2020-02-14T13:33:59.709458Z Stopping NGINX process 
INFO 2020-02-14T13:33:59.718181Z Stopping NGINX process 
ERROR 2020-02-14T13:34:03.010148Z healthcheck error: Get http+unix://nginx-status/is-dynamic-lb-initialized: dial unix /tmp/nginx-status-server.sock: i/o timeout 
ERROR 2020-02-14T13:34:12.627155Z healthcheck error: Get http+unix://nginx-status/is-dynamic-lb-initialized: read unix @->/tmp/nginx-status-server.sock: i/o timeout 
ERROR 2020-02-14T13:34:12.832624Z healthcheck error: Get http+unix://nginx-status/is-dynamic-lb-initialized: read unix @->/tmp/nginx-status-server.sock: i/o timeout 
ERROR 2020-02-14T13:34:13.693853Z healthcheck error: Get http+unix://nginx-status/healthz: read unix @->/tmp/nginx-status-server.sock: i/o timeout 
ERROR 2020-02-14T13:34:13.693930Z healthcheck error: Get http+unix://nginx-status/is-dynamic-lb-initialized: read unix @->/tmp/nginx-status-server.sock: i/o timeout 
INFO 2020-02-14T13:34:41.620594055Z -------------------------------------------------------------------------------
INFO 2020-02-14T13:34:41.620664183Z NGINX Ingress controller
INFO 2020-02-14T13:34:41.620671154Z   Release:       0.25.1
INFO 2020-02-14T13:34:41.620675964Z   Build:         git-5179893a9
INFO 2020-02-14T13:34:41.620681055Z   Repository:    https://github.com/kubernetes/ingress-nginx/
INFO 2020-02-14T13:34:41.620686042Z   nginx version:     openresty/1.15.8.1
INFO 2020-02-14T13:34:41.620691348Z 
INFO 2020-02-14T13:34:41.620695778Z -------------------------------------------------------------------------------
INFO 2020-02-14T13:34:41.620701128Z 
INFO 2020-02-14T13:34:41.622564Z Watching for Ingress class: nginx 
WARN 2020-02-14T13:34:41.622863Z SSL certificate chain completion is disabled (--enable-ssl-chain-completion=false) 
INFO 2020-02-14T13:34:41.623360607Z -------------------------------------------------------------------------------
INFO 2020-02-14T13:34:41.623418446Z NGINX Ingress controller
INFO 2020-02-14T13:34:41.623425256Z   Release:       0.25.1
INFO 2020-02-14T13:34:41.623426Z Watching for Ingress class: nginx 
INFO 2020-02-14T13:34:41.623430244Z   Build:         git-5179893a9
INFO 2020-02-14T13:34:41.623435128Z   Repository:    https://github.com/kubernetes/ingress-nginx/
INFO 2020-02-14T13:34:41.623441533Z   nginx version: openresty/1.15.8.1
INFO 2020-02-14T13:34:41.623447006Z 
INFO 2020-02-14T13:34:41.623451329Z -------------------------------------------------------------------------------
INFO 2020-02-14T13:34:41.623456382Z 
WARN 2020-02-14T13:34:41.623731Z SSL certificate chain completion is disabled (--enable-ssl-chain-completion=false) 
ERROR 2020-02-14T13:34:41.629507140Z nginx version: openresty/1.15.8.1
WARN 2020-02-14T13:34:41.633116Z Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work. 
INFO 2020-02-14T13:34:41.633644Z Creating API client for https://10.103.0.1:443 
ERROR 2020-02-14T13:34:41.640959117Z nginx version: openresty/1.15.8.1
WARN 2020-02-14T13:34:41.642065Z Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work. 
INFO 2020-02-14T13:34:41.642376Z Creating API client for https://10.103.0.1:443 
INFO 2020-02-14T13:34:41.682018Z Running in Kubernetes cluster version v1.13+ (v1.13.12-gke.25) - git (clean) commit 654de8cac69f1fc5db6f2de0b88d6d027bc15828 - platform linux/amd64 
INFO 2020-02-14T13:34:41.700374Z Running in Kubernetes cluster version v1.13+ (v1.13.12-gke.25) - git (clean) commit 654de8cac69f1fc5db6f2de0b88d6d027bc15828 - platform linux/amd64

There is able to see that nginx is (i don't know why) crashed and restarted.

My question is:
What could happen that nginx's healtcheck fail and pod is terminated? Can I somehow configure nginx-ingress about buffering to avoid this happens? Does it happen because of huge logging and disk fails? Or is it because it's buffering uploaded file in nginx and it takes too much time to respond to healthcheck? How to avoid it?

Here are my annotations of nginx-ingress what I already tried but it doesn't work with this annotations and also without them:

nginx.ingress.kubernetes.io/client-body-buffer-size: 5m
nginx.ingress.kubernetes.io/proxy-body-size: 15m
nginx.ingress.kubernetes.io/proxy-buffering: "on"
nginx.org/client-max-body-size: 15m

Technologies and versions:
Kubernetes master version 1.13.12-gke.25
Nodes 1.13.11-gke.14
Nginx-ingress-controller 0.25.1

Thank you for your help because I have no idea what to try more.

score 1 · Answer 1 · answered Feb 14 '20 at 20:23

1

In order to mitigate the failed health checks I would recommend increasing the timeout value for your health checks or updating your nginx version to 0.26.0 as it looks like a fix was put into place at this version. I would suggest utilizing these optimizations for nginx to reduce buffering time. Keep in mind these optimizations are not supported by Google.

answered Feb 14 '20 at 20:23

Gustavo

31
4

I can confirm that error message "read unix @->/tmp/nginx-status-server.sock: i/o timeout" has gone, however it vas replaced with `W 2020-02-16T11:22:59.678039Z Dynamic reconfiguration failed: Post http://127.0.0.1:10246/configuration/backends: dial tcp 127.0.0.1:10246: connect: connection refused ` `E 2020-02-16T11:22:59.680374Z Unexpected failure reconfiguring NGINX: ` `E 2020-02-16T11:22:59.680519634Z Post http://127.0.0.1:10246/configuration/backends: dial tcp 127.0.0.1:10246: connect: connection refused` – Jan Dominik Feb 17 '20 at 06:53
However, when I upload some spefic files (found that PDF/A cause the huge logging), huge logs are written in logs and CPU/Disk IO goes extremely UP. But I don't know how to disable those logs, I think this should be the problem. – Jan Dominik Feb 17 '20 at 06:53
Could you let me know if and what version you updated your cluster to. Can you also provide further details on the specific file PDF/A as well as attempt to increase the amount of nginx workers you currently have. – Gustavo Feb 17 '20 at 16:25
try to use annotation nginx.ingress.kubernetes.io/proxy-request-buffering: off – c4f4t0r Feb 18 '20 at 00:16

score 1 · Answer 2 · answered Feb 18 '20 at 17:09

1

Looks like I've solved the problem. Nginx-ingress included also modsecurity (WAF) which had enabled many rules. After disabling of modsecurity huge logs has gone and until now it seems it works. Now I can successfully upload 20 times 30MB files at single time without any issue in logs and any disk I/O. I will update this answer at the end of the week if it really works without any issue in a long term.

answered Feb 18 '20 at 17:09

Jan Dominik

31
3

Confirm, problem is solved. – Jan Dominik Feb 25 '20 at 10:00
This saved my week, thanks! `nginx.ingress.kubernetes.io/enable-modsecurity: "false"` – klaus thorn Nov 17 '22 at 17:03

GCP Kubernetes engine - crash of nginx-ingress-controller after large file upload

2 Answers2