We are using Istio 1.8.1 and have started using a headless service to get direct pod to pod communication working with Istio mTLS. This is all working fine, but we have recently noticed that sometimes after killing one of our pods we get 503 no healthy upstream errors for a very long time afterwards (many minutes). If we go back to a ‘normal’ service we get a few 503 errors and then the problem is fixed very quickly (but we can't direct requests to a specific pod which we need to do).
We have traced the communications of the envoy container using kubectl sniff and can see that existing connections are maintained for a long period after the pod is killed, and even that new connections are attempted to the previously killed pod IP.
We have circuit breaker configuration on a destination rule for the service in question, and that doesn’t seem to have helped either. We have also tried setting ‘PILOT_ENABLE_EDS_FOR_HEADLESS_SERVICES’ which seemed to improve the 503 errors situation, but strangely interfered with pod to pod direct IP configuration.
Does anyone have any suggestions on why we were receiving the 503 errors or how to avoid them?