1

We are using Linkerd 2.11.1 on Azure AKS Kubernetes. Amongst others there is a Deployment using using an Alpine Linux image containing Apache/mod_php/PHP8 serving an API. HTTPS is resolved by Traefik v2 with cert-manager, so that in coming traffic to the APIs is on port 80. The Linkerd proxy container is injected as a Sidecar.

Recently I saw that the API containers return 504 errors during a short period of time when doing a Rolling deployment. In the Sidecars log, I found the following :

[ 0.000590s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.001062s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[ 0.001078s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[ 0.001081s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.001083s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[ 0.001085s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.my-api.serviceaccount.identity.linkerd.cluster.local
[ 0.001088s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.001090s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.014676s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity: default.my-api.serviceaccount.identity.linkerd.cluster.local
[ 3674.769855s] INFO ThreadId(01) inbound:server{port=80}: linkerd_app_inbound::detect: Handling connection as opaque timeout=linkerd_proxy_http::version::Version protocol detection timed out after 10s

My guess is that this detection leads to the 504 errors somehow. However, if I add the linkerd inbound port annotation to the pod template (terraform syntax):

resource "kubernetes_deployment" "my_api" {
  metadata {
    name = "my-api"
    namespace = "my-api"
    labels = {
      app = "my-api"
    }
  }

  spec {
    replicas = 20
    selector {
      match_labels = {
        app = "my-api"
      }
    }
    template {
      metadata {
        labels = {
          app = "my-api"
        }
        annotations = {
          "config.linkerd.io/inbound-port" = "80"
        }
      }

I get the following:

time="2022-03-01T14:56:44Z" level=info msg="Found pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
time="2022-03-01T14:56:44Z" level=info msg="Found pre-existing CSR: /var/run/linkerd/identity/end-entity/csr.der"
[ 0.000547s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
thread 'main' panicked at 'Failed to bind inbound listener: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }', /github/workspace/linkerd/app/src/lib.rs:195:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Can somebody tell me why it fails to bind the inbound listener?

Any help is much appreciated,

thanks,

Pascal

Pascal Paulis
  • 277
  • 4
  • 17

2 Answers2

2

Found it : Kubernetes sends asynchronuously requests to shutdown the pods and to no longer send traffic to them. And if the pod shuts down faster than it's removal from the IP lists, it can receive requests when already being dead.

To fix this, I added a preStop lifecycle hook to the application container:

lifecycle {
    pre_stop {
        exec {
            command = ["/bin/sh", "-c" , "sleep 5"]
        }
    }
}

and the following annotation to pod template :

annotations = {
    "config.alpha.linkerd.io/proxy-wait-before-exit-seconds" = "10"
}

Documented here :

https://linkerd.io/2.11/tasks/graceful-shutdown/

and here :

https://blog.gruntwork.io/delaying-shutdown-to-wait-for-pod-deletion-propagation-445f779a8304

Pascal Paulis
  • 277
  • 4
  • 17
1
        annotations = {
          "config.linkerd.io/inbound-port" = "80"
        }

I don't think you want this setting. Linkerd will transparently proxy connections without you setting anything.

This setting configures Linkerd's proxy to try to listen on port 80. This would likely conflict with your web server's port configuration; but the specific error you're hitting is that the Linkerd proxy does not run as root and so it does not have permission to bind port 80.

I'd expect it all to work if you removed that annotation :)

  • 1
    Thanks! The error message is gone indeed. I obviously misunderstood that annotation :-) The 504 error remains.. I added a preStop lifecycle hook with a "sleep 60" command, to give the kubernetes service enough time to remove the Pod. The problem persists, but it allowed me to see that the 504 happens more or less exactly at the moment when the linkerd proxy receives the shutdown signal after the sleep: "INFO ThreadId(01) linkerd_proxy::signal: received SIGTERM, starting shutdown INFO ThreadId(01) linkerd2_proxy: Received shutdown signal". Any ideas on this? :-) Many thanks! – Pascal Paulis Mar 07 '22 at 10:00
  • 1
    Found this: https://linkerd.io/2.10/tasks/graceful-shutdown/ will let you know if it works – Pascal Paulis Mar 07 '22 at 10:06
  • 1
    yep, that did the trick: sleep on application container and a longer sleep on linkerd proxy. No more 504 error :-) – Pascal Paulis Mar 07 '22 at 10:41