2

When our PODs gets exhausted and could need some scaling up, Kubernetes confuses this exhaustion with death and restarts our POD:s. This has, of course, the opposite effect, the remaining POD:s gets even more load... So here comes my question, can you serve Kubernetes and LB liveness and readiness endpoints with dedicated, non-exhausted, connections?

We have an older system running in Kubernetes, one Apache httpd and one tomcat bundled in each POD. Load balancing is done by Kubernetes between different POD:s, not in httpd. Httpd is running mpm_event+mod_jk and there is an AJP 1.3 connection to the Tomcat. Httpd is also serving some static resources from disc without the Tomcat. When something fails, we quickly run out of AJP threads and HTTPD workers.

Basically what we see is this:

  1. The application fails to connect to some resource. Some network, Memcached, DB or other service starts to time out. Waiting on timeouts causes threads to be very long-lived and we run out of them quickly.
  2. Readiness/Liveness probs do not respond in time, Kubernetes restarts the POD (or, after we removed the liveness probe, the LB that uses readiness removes them from load balancing, having basically the same effect).
  3. Root cause problem is solved (somehow), but now there are too few (on non) POD:s left in Load balancing. When a POD gets back, it's hit by all the traffic, gets exhausted, and are removed from LB since it's too slow on the readiness probe again.
  4. We now find it very difficult to get out of this state... (So far it happened twice, and we basically had to cut off all traffic on Cloudflare WAF until enough POD:s were restarted/in Loadbalancing...)

My idea of a solution:

I think I can open a prioritised fastlane from httpd->tomcat for the liveness and readiness endpoints, see below. But, can I somehow dedicate workers in httpd (mpm_event) to these endpoints? Else, when I run out of httpd workers, my fast lane will not offer any help I guess. Or any other ideas about how to ensure that we always can serve liveness/readiness as long as the tomcat is alive, even when it is exhausted?

This is my current httpd worker setup:

<IfModule mpm_event_module>
    StartServers             3
    ServerLimit             36
    MinSpareThreads         75
    MaxSpareThreads        250
    ThreadsPerChild         25
    MaxRequestWorkers      900
    MaxConnectionsPerChild   0
</IfModule>

Maybe it takes a worker just to analyze the request and figure out the URI... :-/ Or can I somehow dedicate a specific pool of workers to liveness and readiness???

My httpd->tomcat fastlane:

I was playing around with a second AJP connection to the tomcat, dedicated to the readiness and liveness endpoints. At a glance, it seems to work.

In server.xml I added a connector on port 8008:

    <Connector
        port="8009"
        protocol="AJP/1.3"
        redirectPort="8443"
        connectionTimeout="60000"
        minSpareThreads="2"
        maxThreads="20"
        acceptorThreadCount="2"
        URIEncoding="UTF-8"
        address="127.0.0.1"
        secretRequired="false" />

    <!--
      This is the prioritized connector used for health checks.
    -->
    <Connector
        port="8008"
        protocol="AJP/1.3"
        redirectPort="8443"
        connectionTimeout="-1"
        keepAliveTimeout="-1"
        acceptorThreadPriority="6"
        minSpareThreads="2"
        maxThreads="5"
        acceptorThreadCount="1"
        URIEncoding="UTF-8"
        address="127.0.0.1"
        secretRequired="false" />

In my workers.properties (the JkWorkersFile) I added the new connection and named it ajp13prio:

worker.list=ajp13,ajp13prio
worker.ajp13.type=ajp13
worker.ajp13.port=8009
worker.ajp13.host=127.0.0.1
worker.ajp13.lbfactor=1
worker.ajp13prio.type=ajp13
worker.ajp13prio.port=8008
worker.ajp13prio.host=127.0.0.1
worker.ajp13prio.lbfactor=1

In my httpd conf I configured the probes to use the new connector:

<VirtualHost *:80>
...
    # health checks (readiness and liveness probes) are prioritized
    JkMount /api/v2/health/* ajp13prio

    # All requests go to worker1 by default
    JkMount /* ajp13
...
</VirtualHost>
Andreas L
  • 81
  • 1
  • 7
  • You have accidentally discovered why liveness probes are bad. Usually better to just not use them buttttt Tomcat is one of the few places I have routinely seen actual deadlocks that only a liveness probe would catch so you might be mildly doomed. One option I played with back then was to instead check "either access logs show a request processed in the last X minutes _or_ an HTTP probe works", only trying the HTTP probe if the logs were empty. Never finished working on it but you could try that tack. – coderanger Jun 14 '21 at 21:14
  • @coderanger I agree... 100%. But I'm not sure you can turn off the check that the Loadbalancer does (BackendConfig::healthcheck), at least we need to know when a POD is ready to accept incoming requests. This is our legacy Java application, it will spend a good 5-10 minutes starting up. Maybe by setting the BackendConfig::healthCheck::checkIntervalSec large enough... – Andreas L Jun 15 '21 at 05:57
  • Also true but the loadbalancer health checks won't get you into the cascading failure situation. They might pull an overloaded pod out of rotation, as will the readiness probes, but then it can finish whatever requests it was processing, come back up to speed, and go back in the rotation to get hammered again. This isn't good but it's better than Tomcat chain-restarting and zero requests getting through :) – coderanger Jun 15 '21 at 06:35
  • For us it's the same outcome, if a resource downstream, like the DB, loses connection and starts to time out for a second, all PODs gets removed from LB as soon as their thread pools are empty (within seconds). Then they get back one by one, and no one can handle the load alone, so they quickly get removed from LB again. – Andreas L Jun 15 '21 at 07:52

1 Answers1

0

Just to wrap up if someone else ends up here, running an old Apache setup in kubernetes.

In the end, I added a second connector to Tomcat that was HTTP, and not AJP. The port of this connector was exposed from the container and used by LB and Kubernetes. Thus HTTPD is completely bypassed for health checks. Then this port is (by default) blocked in the LB for external access.

server.xml

    <Connector
        port="8080"
        address="0.0.0.0"
        protocol="HTTP/1.1"
        enableLookups="false"
        maxThreads="5"
        minSpareThreads="2"
        acceptorThreadCount="1"
        acceptorThreadPriority="9"
        connectionTimeout="2000"
        keepAliveTimeout="-1"
        disableUploadTimeout="false"
        connectionUploadTimeout="3600000"
        redirectPort="8443"
        URIEncoding="UTF-8"
        maxPostSize="1" />

deployment.yaml


tomcat container:
          readinessProbe:
            httpGet:
              path: /health/readiness
              port: 8080
            initialDelaySeconds: 120
          livenessProbe:
            httpGet:
              path: /health/liveness
              port: 8080
          ports:
            - name: ajp
              containerPort: 8009
              protocol: TCP
            - name: http-prio
              containerPort: 8080
              protocol: TCP

service.yaml:

apiVersion: v1
kind: Service
metadata:
...
  annotations:
    cloud.google.com/backend-config: '{"default": "app-backendconfig"}'
spec:
  ports:
  - name: http
    port: 80
  - name: http-prio
    port: 8080

...

backendconfig.yaml

apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
...
spec:
...
  healthCheck:
    type: HTTP
    requestPath: /health/readiness
    port: 8080
Andreas L
  • 81
  • 1
  • 7
  • Curious if this ended up solving your problem. We're experiencing this I think. We notice that when we deploy a code change, sometimes CPU usage spikes to 100% on the new pods and Kubernetes then decides they're unhealthy a few seconds later and kills them (presumably before HPA can decide to scale up the deployment). We don't have monitoring in place to be able to tell if it's connection pools being exhausted, but we do know that we share connection pools between app code and health check code right now. – Matt Welke Jan 13 '22 at 21:45
  • Looking at what's running in prod now, I can see that the `liveness` probe is commented out in the deployment-file, but the rest is as I wrote above. And webpage is stable since the move. We are trying to get more and more traffic out of the monolith, but it's currently at 20 POD:s service about 70k rpm right now. We never entered this state after this change described above. – Andreas L Jan 15 '22 at 09:29
  • Not much of a production-ready monitoring, but if you try and trigger this in a test env, you can port forward to your httpd using `kubectl port-forward deployment/[YOUR DEPLYMENT] 8080:80` followed by curling server status on httpd using `curl http://localhost:8080/server-status`. – Andreas L Jan 15 '22 at 09:33