Knative intermittently failing to create deployments

Question

I've been running into this issue where every once in a while Knative will become unable to create new Deployments, and will spontaneously recover within a few hours and create it. Until then, the following errors keep playing out within the serving components. What it feels like to me is the requests to kubernetes service are timing out, but I cannot tell why.

Expected Behavior

On making updates to a service, expecting deployment of new revision to work.

Actual Behavior

Occasionally, while making valid changes ex: changing the value of an annotation Knative will become unable to deploy a new revision, getting stuck in the state of constantly trying to reconcile it for hours before spontaneously recovering.

$ kn revision list -A
NAMESPACE        NAME                       SERVICE              TRAFFIC   TAGS      GENERATION   AGE         CONDITIONS   READY     REASON
knative       service-00033                  service                                 33           <invalid>   0 OK / 3     Unknown   Deploying
knative       service-00032                  service             100%      primary   32           <invalid>   4 OK / 4     True

In the controller logs I see the following context deadline exceeded error while trying to post to the Kubernetes service IP:

{
  "insertId": "plhs429mzmf9nh5f",
  "jsonPayload": {
    "logger": "controller.event-broadcaster",
    "caller": "record/event.go:285",
    "knative.dev/pod": "controller-8c6b99cb7-7zg6n",
    "commit": "484e848",
    "message": "Event(v1.ObjectReference{Kind:\"Revision\", Namespace:\"knative\", Name:\"service-00033\", UID:\"8a09a3ff-655e-4e5f-b8d4-1a4886ab0678\", APIVersion:\"serving.knative.dev/v1\", ResourceVersion:\"1844291799\", FieldPath:\"\"}): type: 'Warning' reason: 'InternalError' failed to create deployment \"service-api-00033-deployment\": Post \"https://10.123.20.1:443/apis/apps/v1/namespaces/knative/deployments\": context deadline exceeded",
    "timestamp": "2023-06-30T09:57:08.7332053Z"
  }

and right before it the following in Webhook logs:

{
  "insertId": "k078pd2dmx16qrr7",
  "jsonPayload": {
    "knative.dev/pod": "webhook-d44b476b8-89gbx",
    "message": "Failed the resource specific validation",
    "knative.dev/operation": "UPDATE",
    "logger": "webhook",
    "knative.dev/name": "service",
    "knative.dev/subresource": "",
    "knative.dev/namespace": "knative",
    "knative.dev/kind": "serving.knative.dev/v1, Kind=Service",
    "knative.dev/resource": "serving.knative.dev/v1, Resource=services",
    "commit": "484e848",
    "knative.dev/userinfo": "system:serviceaccount:service:default",
    "timestamp": "2023-06-30T09:56:38.327880939Z",
    "caller": "validation/validation_admit.go:183",
    "stacktrace": "knative.dev/pkg/webhook/resourcesemantics/validation.validate\n\tknative.dev/pkg@v0.0.0-20230117181655-247510c00e9d/webhook/resourcesemantics/validation/validation_admit.go:183\nknative.dev/pkg/webhook/resourcesemantics/validation.(*reconciler).Admit\n\tknative.dev/pkg@v0.0.0-20230117181655-247510c00e9d/webhook/resourcesemantics/validation/validation_admit.go:79\nknative.dev/pkg/webhook.admissionHandler.func1\n\tknative.dev/pkg@v0.0.0-20230117181655-247510c00e9d/webhook/admission.go:123\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2109\nnet/http.(*ServeMux).ServeHTTP\n\tnet/http/server.go:2487\nknative.dev/pkg/webhook.(*Webhook).ServeHTTP\n\tknative.dev/pkg@v0.0.0-20230117181655-247510c00e9d/webhook/webhook.go:263\nknative.dev/pkg/network/handlers.(*Drainer).ServeHTTP\n\tknative.dev/pkg@v0.0.0-20230117181655-247510c00e9d/network/handlers/drain.go:113\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2947\nnet/http.(*conn).serve\n\tnet/http/server.go:1991"
  }

At a complete loss here at this point.

Steps to Reproduce the Problem

Unknown

score 1 · Answer 1 · answered Jul 01 '23 at 15:50

1

I haven't looked at your service yaml, but I have a hypothesis that this might be related to slow tag to digest resolution. Your can try the following:

Monitor latency for registry operations, particularly GET operations.
Use image digests when referencing images. These look like @sha256:... rather than :latest, and ensure that the image does not change after deployment.
Disable tag to digest resolution. Note that this can lead to unpredictable behavior if a referenced tag is moved. Some instances may pick up the new image, while other instances may use an earlier image.

If this is tag to digest resolution and you're using public Dockerhub images, adding pull credentials to the service account that's running the Knative Service might give you higher rate limits.

answered Jul 01 '23 at 15:50

E. Anderson

3,405
1
16
19

Thank you for your reply. Could you please elaborate on why you think it could be this? I am using a custom GCR image and not using digests, so I will be looking into this as a possibility, however I don't understand how you developed this hypothesis based on the logs I've provided, as for me it wasnt signaling anything really. – Desolar1um Jul 03 '23 at 07:08
For the sake of clarity id like to emphasize that knative never arrives at deploying a single pod for the new revision. The problem is that it cannot even create the deployment k8s object @E. Anderson – Desolar1um Jul 03 '23 at 07:19
I think I understand where you are coming from now. The first "Failed the resource specific validation" error occurs within seconds of me updating my service, however. I feel like that is a bit too quick of an error out for tag resolution – Desolar1um Jul 03 '23 at 08:59

Knative intermittently failing to create deployments

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

1 Answers1