0

We have written a custom predictor for KServe. The model is loaded into the GPU directly and due to the size of the model it usually takes 3–4 mins to load.

Here is how our inference service looks

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: custom-model
spec:
  predictor:
    containers:
      - name: kserve-container
        image: private-registry/kserve:1.0.3
        readinessProbe:
          httpGet:
            path: /
            port: 8080
          failureThreshold: 100
          initialDelaySeconds: 300
          periodSeconds: 300
        resources:
          requests:
            nvidia.com/gpu: 1
            cpu: 6000m
            memory: 16Gi
          limits:
            nvidia.com/gpu: 1
            cpu: 6000m
            memory: 16Gi

But the readiness checks is not taken into consideration by the queue-proxy container which starts calling the model container as soon as it starts.

enter image description here

Is there a way to pause the queue-proxy to wait for the model to load?

aesher9o1
  • 35
  • 1
  • 8

1 Answers1

0

queue-proxy will eagerly (read: aggressively) probe the user container during startup (particularly when cold-starting a container) on the general notion that there is a user / caller on the other end of the HTTP request which is waiting for the pod to become ready.

Once the initial startup probing is complete, I expect that queue-proxy will probe less aggressively.

Is the concern about the probes, or about the log messages indicating that the probes are failing (and filling up your logs)?

E. Anderson
  • 3,405
  • 1
  • 16
  • 19
  • We are using a large model that takes about 15min to load. Queue proxy marks the container unhealthy till that time and restarts the container. We are never able to load the model because of this as it always restarts. – aesher9o1 Nov 06 '22 at 04:23
  • You might need to add a startup probe or extend the probe duration for your reasons probe to allow for the 15 minute startup. – E. Anderson Nov 07 '22 at 14:39
  • It's possible that a 15 minute startup wasn't considered in the current code, so if neither of those adjustments work, I'd file a bug at https://github.com/knative/serving/issues/new – E. Anderson Nov 07 '22 at 14:41