Objective
I have a task to write API Gateway & load balancer with the following objectives:
- Gateway/LB should redirect requests to instances of 3rd party service (no code change = client-side service discovery)
- Each service instance is able to process only single response simultaneously, concurrent request = immediate error response.
- Services response latency is 0-5 seconds. I can't cache their responses, and therefore as I understand fallback is not an option for me. Also timeout is not an option, because latency is random and you haven't warranty you'll get better one on another instance.
My solution
Spring Boot Cloud Netflix: Zuul-Hystrix-Ribbon. Two approaches:
- Retry. Ribbon retry with fixed interval or exponential increase. I failed to make it work, the best result I achieved is
MaxAutoRetriesNextServer: 1000
, where Ribbon fires retries immediatelly and spamming donwstream services. - Circuit Breaker. Instead of adding exponential wait period in Ribbon, I can open circuit after few fails and redirect requests to another services. This also not the best approach for two reasons: a) having only few instances each with 0-5 sec latency means open all circuits very quickly and fail to serve request. b) my configuration doesn't work for some reason
Question
How can I make Ribbon wait between retries? Or can I solve my problem with Circuit Breaker?
My configuration
Full config could be found on GitHub.
ribbon:
eureka:
enabled: false
# Obsolete option (Apache HttpClient by default), but without this Ribbon doesn't retry against another instances
restclient:
enabled: true
hystrix:
command:
my-service:
circuitBreaker:
sleepWindowInMilliseconds: 3000
errorThresholdPercentage: 50
requestVolumeThreshold: 5
execution:
isolation:
thread:
timeoutInMilliseconds: 5500
my-service:
ribbon:
OkToRetryOnAllOperations: false
NFLoadBalancerRuleClassName: com.netflix.loadbalancer.WeightedResponseTimeRule
listOfServers: ${LIST_OF_SERVERS}
ConnectTimeout: 500
ReadTimeout: 4500
MaxAutoRetries: 0
MaxAutoRetriesNextServer: 1000
retryableStatusCodes: 404,502,503,504
Tests
In order to check your assumptions, you can play with the test on GitHub, that simulates single-thread service instances with different latencies