0

I'm running out of ideas of what's causing troubles here.

My set up :

  • A kubernetes (v1.26) cluster with one master node and one worker, self deployed on VMs
  • A Nginx reverse proxy (currently on the master)
  • A Basic FastAPI pod, with the Deployement, Service and Ingress yaml bellow

I have the exact same environment deployed on another cloud provider, and no trouble at all.

Here, everything works fine for a moment, API is accessible through the browser, then it fails with a 504 Gateway Time out error. Restarting the Nginx pod fixes the issue for an undetermined period again. I witnessed the connexion failing and working again a few minutes apparts, at the time of writting it's been an hour since it's working fine without interruption.

Here's the nginx logs between a successful request and a time out :

X.X.X.X - - [09/Feb/2023:12:30:18 +0000] "GET /docs HTTP/1.1" 200 952 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0" 373 0.019 [my-app-8005] [] 172.16.180.6:8005 952 0.019 200 22cd1b13ef2dcbf4b1be2983649f658c
X.X.X.X  - - [09/Feb/2023:12:30:19 +0000] "GET /openapi.json HTTP/1.1" 200 5868 "http://xxxx/docs" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0" 323 0.003 [my-app-8005] [] 172.16.180.6:8005 5868 0.003 200 46551c8481d446ec69de2399f49b7f86
I0209 12:31:13.983933       7 queue.go:87] "queuing" item="&ObjectMeta{Name:sync status,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[]OwnerReference{},Finalizers:[],ManagedFields:[]ManagedFieldsEntry{},}"
I0209 12:31:13.984018       7 queue.go:128] "syncing" key="&ObjectMeta{Name:sync status,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[]OwnerReference{},Finalizers:[],ManagedFields:[]ManagedFieldsEntry{},}"
I0209 12:31:13.990418       7 status.go:275] "skipping update of Ingress (no change)" namespace="namespace" ingress="app-ingress-xxxx"
I0209 12:32:13.983857       7 queue.go:87] "queuing" item="&ObjectMeta{Name:sync status,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[]OwnerReference{},Finalizers:[],ManagedFields:[]ManagedFieldsEntry{},}"
I0209 12:32:13.983939       7 queue.go:128] "syncing" key="&ObjectMeta{Name:sync status,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[]OwnerReference{},Finalizers:[],ManagedFields:[]ManagedFieldsEntry{},}"
I0209 12:32:13.990895       7 status.go:275] "skipping update of Ingress (no change)" namespace="namespace" ingress="app-ingress-xxxx"
2023/02/09 12:32:59 [error] 30#30: *4409 upstream timed out (110: Operation timed out) while connecting to upstream, client: X.X.X.X , server: xxxx, request: "GET /docs HTTP/1.1", upstream: "http://172.16.180.6:8005/docs", host: "xxxx"
2023/02/09 12:33:04 [error] 30#30: *4409 upstream timed out (110: Operation timed out) while connecting to upstream, client: X.X.X.X , server: xxxx, request: "GET /docs HTTP/1.1", upstream: "http://172.16.180.6:8005/docs", host: "xxxx"
2023/02/09 12:33:09 [error] 30#30: *4409 upstream timed out (110: Operation timed out) while connecting to upstream, client: X.X.X.X , server: xxxx, request: "GET /docs HTTP/1.1", upstream: "http://172.16.180.6:8005/docs", host: "xxxx"
X.X.X.X  - - [09/Feb/2023:12:33:09 +0000] "GET /docs HTTP/1.1" 504 160 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0" 373 15.004 [my-app-8005] [] 172.16.180.6:8005, 172.16.180.6:8005, 172.16.180.6:8005 0, 0, 0 5.001, 5.001, 5.001 504, 504, 504 56fb622d8d89d8d7b3cdbc4a094215c3

Yaml config files :

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress-xxxx
spec:
  ingressClassName: nginx
  rules:
  - host: xxxx
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-app
            port: 
              number: 8005
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: namespace
spec:
  progressDeadlineSeconds: 3600
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
     containers:
      - name: backend
        image: xxxx
        imagePullPolicy: Always
        ports:
        - containerPort: 8005
     imagePullSecrets:
     - name: xxxx

---
apiVersion: v1
kind: Service
metadata:
  name: my-app
  namespace: namespace
  labels:
    app: my-app
spec:
  type: NodePort
  ports:
  - nodePort: 30008
    port: 8005
    protocol: TCP
  selector:
    app: my-app

I changed the apps and IP for publishing here.

I constated that during timeouts when querying through nginx, I could still access it using the worker-ip:nodePort adress and ssh on the master and curl the fastapi pod using the ClusterIP.

My first guess would be memory issues, even though there is nothing else running on the server right now. I just installed the kubernetes metrics API and I'm currently waiting to notice downtime again, no problem so far.

What could be the cause of such behavior ? Thanks for any suggestion on what to check further!

peppie
  • 35
  • 7
  • The timeout stopped after this post, without any configuration change. I can only suspect RAM usage issues, but I was not able to reproduce the issue so far – peppie Mar 27 '23 at 11:50

1 Answers1

0

If you are getting 504 gateway time-out errors, your system might be low on resources. Increase your environmental resources to resolve this issue. A 504 error means nginx has waited too long for a response and has timed out. Additionally you need to add the ingress annotations to the yaml config file. By default proxy_read_timeout 60s;

Syntax : proxy_read_timeout time;
Default: proxy_read_timeout 60s;
Context: http, server, location

Defines a timeout for reading a response from the proxied server. The timeout is set only between two successive read operations, not for the transmission of the whole response. If the proxied server does not transmit anything within this time, the connection is closed. For more information refer to the documentation

Mayur Kamble
  • 180
  • 4