We have written a custom predictor for KServe. The model is loaded into the GPU directly and due to the size of the model it usually takes 3–4 mins to load.
Here is how our inference service looks
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: custom-model
spec:
predictor:
containers:
- name: kserve-container
image: private-registry/kserve:1.0.3
readinessProbe:
httpGet:
path: /
port: 8080
failureThreshold: 100
initialDelaySeconds: 300
periodSeconds: 300
resources:
requests:
nvidia.com/gpu: 1
cpu: 6000m
memory: 16Gi
limits:
nvidia.com/gpu: 1
cpu: 6000m
memory: 16Gi
But the readiness checks is not taken into consideration by the queue-proxy container which starts calling the model container as soon as it starts.
Is there a way to pause the queue-proxy to wait for the model to load?