We run a kubernetes cluster provisioned with kubespray and discovered that each time when a faulty node goes down (we had this due to hardware issue recently) the pods executing on this node stuck in Terminating state indefinitely. Even after many hours the pods are not being redeploying on healthy nodes and thus our entire application is malfunctioning and the users are affected for a prolonged period of time.
How it is possible to configure kubernetes to perform failover in situations like this?
Below is our statefulset manifest.
apiVersion: apps/v1
kind: StatefulSet
metadata:
namespace: project-stock
name: ps-ra
spec:
selector:
matchLabels:
infrastructure: ps
application: report-api
environment: staging
serviceName: hl-ps-report-api
replicas: 1
template:
metadata:
namespace: project-stock
labels:
infrastructure: ps
application: report-api
environment: staging
spec:
terminationGracePeriodSeconds: 10
containers:
- name: ps-report-api
image: localhost:5000/ps/nodejs-chrome-application:latest
ports:
- containerPort: 3000
protocol: TCP
name: nodejs-rest-api
volumeMounts:
resources:
limits:
cpu: 1000m
memory: 8192Mi
requests:
cpu: 333m
memory: 8192Mi
livenessProbe:
httpGet:
path: /health/
port: 3000
initialDelaySeconds: 180
periodSeconds: 10
failureThreshold: 12
timeoutSeconds: 10