Openshift 3.11 - while upgrading from spring boot 1.4.5 --> 2.6.1 we are observing intermidiate timeouts for liveness probe with below warning :
Liveness probe failed: Get http://172.40.23.99:8090/monitoring/health: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
The traffic is very less and the memory/cpu/threads is much beyond limits thresholds. The issue is reproduced on different cluster compute nodes.
Deployment configuration/hardware/resources wasn't changed as part of the upgrade.
deployment configuration for liveness probe :
Liveness: http-get http://:8090/monitoring/health delay=90s timeout=3s period=50s #success=1 #failure=5
Docker base image:
"name": "redhat-openjdk-18/openjdk18-openshift","version": "1.12"
From access logs the health checks completed in ms's - the defined timeout for liveness is 3 seconds:
- 10.131.4.1 - - [11/Sep/2022:14:22:07 +0000] "GET /monitoring/health HTTP/1.1" 200 907 13
- 10.131.4.1 - - [11/Sep/2022:14:22:57 +0000] "GET /monitoring/health HTTP/1.1" 200 907 21
- 10.131.4.1 - - [11/Sep/2022:14:23:47 +0000] "GET /monitoring/health HTTP/1.1" 200 907 9
- 10.131.4.1 - - [11/Sep/2022:14:24:37 +0000] "GET /monitoring/health HTTP/1.1" 200 907 19
- 10.131.4.1 - - [11/Sep/2022:14:25:27 +0000] "GET /monitoring/health HTTP/1.1" 200 907 8
Tried to disable all the components that being checked as part of the actuator health check (db,redis,diskspace,ping,refresh...) - same behavoir.
One important observation is that when scalling up - adding more instances, the warning disapears, also by blocking any incoming traffic, the warning is also not coming. it seems that somehow the issue is resources related and something is being choked periodically, but all available metrics are fine. any suggestion?