Load balancing problems with Spring Cloud Kubernetes

Question

We have Spring Boot services running in Kubernetes and are using the Spring Cloud Kubernetes Load Balancer functionality with RestTemplate to make calls to other Spring Boot services. One of the main reasons we have this in place is historical - in that previously we ran our services in EC2 using Eureka for service discovery and after the migration we kept the Spring discovery client/client-side load balancing in place (updating dependencies etc for it to work with the Spring Cloud Kubernetes project)

We have a problem that when one of the target pods goes down we get multiple failures for requests for a period of time with java.net.NoRouteToHostException ie the spring load balancer is still trying to send to that pod.

So I have a few questions on this:

Shouldn't the target instance get removed automatically when this happens? So it might happen once but after that, the target pod list will be repaired?
Or if not is there some other configuration we need to add to handle this - eg retry / circuit breaker, etc?
A more general question is what benefit does Spring's client-side load balancing bring with Kubernetes? Without it, our service would still be able to call other services using Kubernetes built-in service / load-balancing functionality and this should handle the issue of pods going down automatically. The Spring documentation also talks about being able to switch from POD mode to SERVICE mode (https://docs.spring.io/spring-cloud-kubernetes/docs/current/reference/html/index.html#loadbalancer-for-kubernetes). But isn't this service mode just what Kubernetes does automatically? I'm wondering if the simplest solution here isn't to remove the Spring Load Balancer altogether? What would we lose then?

If you remove it, you will need to code your own way of selecting the instance using service discovery. The issue might be related to caching - have you tried shortening cache TTL? Also, you could try using Instance HealthCheck and disable the main caching mechanism as an alternative (health-checks have its own caching) - https://docs.spring.io/spring-cloud-commons/docs/current/reference/html/#instance-health-check-for-loadbalancer — OlgaMaciaszek, Dec 16 '21 at 12:53

score 1 · Answer 1 · answered Dec 21 '21 at 11:51

An update on this: we had the spring-retry dependency in place, but the retry was not working as by default it only works for GETs and most of our calls are POST (but OK to call again). Adding the configuration spring.cloud.loadbalancer.retry.retryOnAllOperations: true fixed this, and hence most of these failures should be avoided by the retry using an alternative instance on the second attempt.

We have also added a RetryListener that clears the load balancer cache for the service on certain connection exceptions:

@Configuration
public class RetryConfig {

    private static final Logger logger = LoggerFactory.getLogger(RetryConfig.class);
    
    // Need to use bean factory here as can't autowire LoadBalancerCacheManager -
    // - it's set to 'autowireCandidate = false' in LoadBalancerCacheAutoConfiguration
    @Autowired
    private BeanFactory beanFactory;
    
    @Bean 
    public CacheClearingLoadBalancedRetryFactory cacheClearingLoadBalancedRetryFactory(ReactiveLoadBalancer.Factory<ServiceInstance> loadBalancerFactory) {
        return new CacheClearingLoadBalancedRetryFactory(loadBalancerFactory);
    }
    
    // Extension of the default bean that defines a retry listener
    public class CacheClearingLoadBalancedRetryFactory extends BlockingLoadBalancedRetryFactory {

        public CacheClearingLoadBalancedRetryFactory(ReactiveLoadBalancer.Factory<ServiceInstance> loadBalancerFactory) {
            super(loadBalancerFactory);
        }

        @Override
        public RetryListener[] createRetryListeners(String service) {
            
            RetryListener cacheClearingRetryListener = new RetryListener() {
                
                @Override
                public <T, E extends Throwable> boolean open(RetryContext context, RetryCallback<T, E> callback) { return true; }
                
                @Override
                public <T, E extends Throwable> void close(RetryContext context, RetryCallback<T, E> callback, Throwable throwable) {}

                @Override
                public <T, E extends Throwable> void onError(RetryContext context, RetryCallback<T, E> callback, Throwable throwable) {
                    
                    logger.warn("Retry for service {} picked up exception: context {}, throwable class {}", service, context, throwable.getClass());
                    
                    if (throwable instanceof ConnectTimeoutException || throwable instanceof NoRouteToHostException) {
                
                        try {   
                            LoadBalancerCacheManager loadBalancerCacheManager = beanFactory.getBean(LoadBalancerCacheManager.class);                                        
                            Cache loadBalancerCache = loadBalancerCacheManager.getCache(CachingServiceInstanceListSupplier.SERVICE_INSTANCE_CACHE_NAME);            
                            if (loadBalancerCache != null) {                    
                                boolean result = loadBalancerCache.evictIfPresent(service);
                                logger.warn("Load Balancer Cache evictIfPresent result for service {} is {}", service, result);                             
                            }                           
                        } catch(Exception e) {
                            logger.error("Failed to clear load balancer cache", e);
                        }
                    }
                }                               
            };
                
            return new RetryListener[] { cacheClearingRetryListener };              
        }
    }
}

Are there any issues with this approach? Could something like this be added to the built in functionality?

I had a similar issue (on a simpler scenario) and the Spring-Retry dependency solved. — Gabriel Aramburu, Sep 20 '22 at 13:01

Harsh Manvar · Answer 2 · 2021-12-15T13:48:37.323

Shouldn't the target instance get removed automatically when this happens? So it might happen once but after that the target pod list will be repaired?

To resolve this issue you have to use the Readiness and Liveness Probe in Kubernetes.

Readiness will check the health of the endpoint that your application has, on the period of interval. If the application fails it will mark your PODs as Unready to accept the Traffic. So no traffic will go to that POD(replica).

Liveness will restart your application if it fails so your container or we can say POD will come up again and once we will get 200 response from app K8s will mark your POD as Ready to accept the traffic.

You can create the simple endpoint in the application that give response as 200 or 204 as per need.

Read more at : https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

Make sure you application using the Kubernetes service to talk with each other.

Application 1 > Kubernetes service of App 2 > Application 2 PODs

To enable load balancing based on Kubernetes Service name use the following property. Then load balancer would try to call application using address, for example service-a.default.svc.cluster.local

spring.cloud.kubernetes.loadbalancer.mode=SERVICE

The most typical way to use Spring Cloud LoadBalancer on Kubernetes is with service discovery. If you have any DiscoveryClient on your classpath, the default Spring Cloud LoadBalancer configuration uses it to check for service instances. As a result, it only chooses from instances that are up and running. All that is needed is to annotate your Spring Boot application with @EnableDiscoveryClientto enable K8s-native Service Discovery.

References : https://stackoverflow.com/a/68536834/5525824

Load balancing problems with Spring Cloud Kubernetes

2 Answers2