I'm running out of ideas of what's causing troubles here.
My set up :
- A kubernetes (v1.26) cluster with one master node and one worker, self deployed on VMs
- A Nginx reverse proxy (currently on the master)
- A Basic FastAPI pod, with the Deployement, Service and Ingress yaml bellow
I have the exact same environment deployed on another cloud provider, and no trouble at all.
Here, everything works fine for a moment, API is accessible through the browser, then it fails with a 504 Gateway Time out error. Restarting the Nginx pod fixes the issue for an undetermined period again. I witnessed the connexion failing and working again a few minutes apparts, at the time of writting it's been an hour since it's working fine without interruption.
Here's the nginx logs between a successful request and a time out :
X.X.X.X - - [09/Feb/2023:12:30:18 +0000] "GET /docs HTTP/1.1" 200 952 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0" 373 0.019 [my-app-8005] [] 172.16.180.6:8005 952 0.019 200 22cd1b13ef2dcbf4b1be2983649f658c
X.X.X.X - - [09/Feb/2023:12:30:19 +0000] "GET /openapi.json HTTP/1.1" 200 5868 "http://xxxx/docs" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0" 323 0.003 [my-app-8005] [] 172.16.180.6:8005 5868 0.003 200 46551c8481d446ec69de2399f49b7f86
I0209 12:31:13.983933 7 queue.go:87] "queuing" item="&ObjectMeta{Name:sync status,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[]OwnerReference{},Finalizers:[],ManagedFields:[]ManagedFieldsEntry{},}"
I0209 12:31:13.984018 7 queue.go:128] "syncing" key="&ObjectMeta{Name:sync status,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[]OwnerReference{},Finalizers:[],ManagedFields:[]ManagedFieldsEntry{},}"
I0209 12:31:13.990418 7 status.go:275] "skipping update of Ingress (no change)" namespace="namespace" ingress="app-ingress-xxxx"
I0209 12:32:13.983857 7 queue.go:87] "queuing" item="&ObjectMeta{Name:sync status,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[]OwnerReference{},Finalizers:[],ManagedFields:[]ManagedFieldsEntry{},}"
I0209 12:32:13.983939 7 queue.go:128] "syncing" key="&ObjectMeta{Name:sync status,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[]OwnerReference{},Finalizers:[],ManagedFields:[]ManagedFieldsEntry{},}"
I0209 12:32:13.990895 7 status.go:275] "skipping update of Ingress (no change)" namespace="namespace" ingress="app-ingress-xxxx"
2023/02/09 12:32:59 [error] 30#30: *4409 upstream timed out (110: Operation timed out) while connecting to upstream, client: X.X.X.X , server: xxxx, request: "GET /docs HTTP/1.1", upstream: "http://172.16.180.6:8005/docs", host: "xxxx"
2023/02/09 12:33:04 [error] 30#30: *4409 upstream timed out (110: Operation timed out) while connecting to upstream, client: X.X.X.X , server: xxxx, request: "GET /docs HTTP/1.1", upstream: "http://172.16.180.6:8005/docs", host: "xxxx"
2023/02/09 12:33:09 [error] 30#30: *4409 upstream timed out (110: Operation timed out) while connecting to upstream, client: X.X.X.X , server: xxxx, request: "GET /docs HTTP/1.1", upstream: "http://172.16.180.6:8005/docs", host: "xxxx"
X.X.X.X - - [09/Feb/2023:12:33:09 +0000] "GET /docs HTTP/1.1" 504 160 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0" 373 15.004 [my-app-8005] [] 172.16.180.6:8005, 172.16.180.6:8005, 172.16.180.6:8005 0, 0, 0 5.001, 5.001, 5.001 504, 504, 504 56fb622d8d89d8d7b3cdbc4a094215c3
Yaml config files :
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress-xxxx
spec:
ingressClassName: nginx
rules:
- host: xxxx
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-app
port:
number: 8005
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
namespace: namespace
spec:
progressDeadlineSeconds: 3600
replicas: 1
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: backend
image: xxxx
imagePullPolicy: Always
ports:
- containerPort: 8005
imagePullSecrets:
- name: xxxx
---
apiVersion: v1
kind: Service
metadata:
name: my-app
namespace: namespace
labels:
app: my-app
spec:
type: NodePort
ports:
- nodePort: 30008
port: 8005
protocol: TCP
selector:
app: my-app
I changed the apps and IP for publishing here.
I constated that during timeouts when querying through nginx, I could still access it using the worker-ip:nodePort adress and ssh on the master and curl the fastapi pod using the ClusterIP.
My first guess would be memory issues, even though there is nothing else running on the server right now. I just installed the kubernetes metrics API and I'm currently waiting to notice downtime again, no problem so far.
What could be the cause of such behavior ? Thanks for any suggestion on what to check further!