I've been running into this issue where every once in a while Knative will become unable to create new Deployments, and will spontaneously recover within a few hours and create it. Until then, the following errors keep playing out within the serving components. What it feels like to me is the requests to kubernetes service are timing out, but I cannot tell why.
Expected Behavior
On making updates to a service, expecting deployment of new revision to work.
Actual Behavior
Occasionally, while making valid changes ex: changing the value of an annotation
Knative will become unable to deploy a new revision, getting stuck in the state of constantly trying to reconcile it for hours before spontaneously recovering.
$ kn revision list -A
NAMESPACE NAME SERVICE TRAFFIC TAGS GENERATION AGE CONDITIONS READY REASON
knative service-00033 service 33 <invalid> 0 OK / 3 Unknown Deploying
knative service-00032 service 100% primary 32 <invalid> 4 OK / 4 True
In the controller logs I see the following context deadline exceeded error while trying to post to the Kubernetes service IP:
{
"insertId": "plhs429mzmf9nh5f",
"jsonPayload": {
"logger": "controller.event-broadcaster",
"caller": "record/event.go:285",
"knative.dev/pod": "controller-8c6b99cb7-7zg6n",
"commit": "484e848",
"message": "Event(v1.ObjectReference{Kind:\"Revision\", Namespace:\"knative\", Name:\"service-00033\", UID:\"8a09a3ff-655e-4e5f-b8d4-1a4886ab0678\", APIVersion:\"serving.knative.dev/v1\", ResourceVersion:\"1844291799\", FieldPath:\"\"}): type: 'Warning' reason: 'InternalError' failed to create deployment \"service-api-00033-deployment\": Post \"https://10.123.20.1:443/apis/apps/v1/namespaces/knative/deployments\": context deadline exceeded",
"timestamp": "2023-06-30T09:57:08.7332053Z"
}
and right before it the following in Webhook logs:
{
"insertId": "k078pd2dmx16qrr7",
"jsonPayload": {
"knative.dev/pod": "webhook-d44b476b8-89gbx",
"message": "Failed the resource specific validation",
"knative.dev/operation": "UPDATE",
"logger": "webhook",
"knative.dev/name": "service",
"knative.dev/subresource": "",
"knative.dev/namespace": "knative",
"knative.dev/kind": "serving.knative.dev/v1, Kind=Service",
"knative.dev/resource": "serving.knative.dev/v1, Resource=services",
"commit": "484e848",
"knative.dev/userinfo": "system:serviceaccount:service:default",
"timestamp": "2023-06-30T09:56:38.327880939Z",
"caller": "validation/validation_admit.go:183",
"stacktrace": "knative.dev/pkg/webhook/resourcesemantics/validation.validate\n\tknative.dev/pkg@v0.0.0-20230117181655-247510c00e9d/webhook/resourcesemantics/validation/validation_admit.go:183\nknative.dev/pkg/webhook/resourcesemantics/validation.(*reconciler).Admit\n\tknative.dev/pkg@v0.0.0-20230117181655-247510c00e9d/webhook/resourcesemantics/validation/validation_admit.go:79\nknative.dev/pkg/webhook.admissionHandler.func1\n\tknative.dev/pkg@v0.0.0-20230117181655-247510c00e9d/webhook/admission.go:123\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2109\nnet/http.(*ServeMux).ServeHTTP\n\tnet/http/server.go:2487\nknative.dev/pkg/webhook.(*Webhook).ServeHTTP\n\tknative.dev/pkg@v0.0.0-20230117181655-247510c00e9d/webhook/webhook.go:263\nknative.dev/pkg/network/handlers.(*Drainer).ServeHTTP\n\tknative.dev/pkg@v0.0.0-20230117181655-247510c00e9d/network/handlers/drain.go:113\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2947\nnet/http.(*conn).serve\n\tnet/http/server.go:1991"
}
At a complete loss here at this point.
Steps to Reproduce the Problem
Unknown