1

I'm trying to enable retry capability within a Zuul gateway, and am able to get things working locally, but when I deploy the gateway to PCF, I get the following error when zuul.retryable=true:

{
  "timestamp": 1524669167094,
  "status": 500,
  "error": "Internal Server Error",
  "exception": "com.netflix.zuul.exception.ZuulException",
  "message": "COMMAND_EXCEPTION"
}

The related logs give me the following exception details:

com.netflix.zuul.exception.ZuulException: Forwarding error
Caused by: com.netflix.hystrix.exception.HystrixRuntimeException: spring-demo failed and no fallback available.
Caused by: org.apache.http.NoHttpResponseException: spring-demo.example.com:443 failed to respond

I've tested spring-demo.example.com and it responds correctly (200) within 200 ms and Zuul is also able to get a valid response when I remove zuul.retryable property (although then it doesn't retry any error status codes or timeouts).

When I run locally, I can see the RibbonLoadBalancedRetryPolicy try the different instances on timeout or when getting a 500 so it's only in PCF that I'm getting the error. I've verified that the instances show up in the PCF Eureka and also tried increasing the connect/read/hystrix timeouts.


Here's the service layout:

  • 2 instances of "working" app connected to Eureka as "spring-demo"
  • 2 instances of "broken" app connected to Eureka as "spring-demo" (times out or returns 500)
  • Zuul connected to Eureka

Zuul application.yml:

zuul:
  ignoredServices: '*'
  ignoredPatterns: '/**/actuator/**'
  retryable: true
  routes:
    spring-demo: '/spring-demo/**'

ribbon:
  retryableStatusCodes: 404, 500
  MaxAutoRetries: 1
  MaxAutoRetriesNextServer: 5
  OkRetryOnConnectionErrors: true

Gradle dependency versions:

  • Spring Boot 1.5.12.RELEASE
  • Spring Cloud Edgware.SR3
  • Pivotal Services 1.6.3.RELEASE
  • spring-boot-starter-web
  • spring-boot-starter-actuator
  • spring-cloud-starter-netflix-zuul
  • spring-retry
  • spring-cloud-services-starter-service-registry
  • spring-cloud-services-starter-circuit-breaker
Jeff
  • 3,307
  • 2
  • 25
  • 36
  • `NoHttpResponseException` means Zuul sent an HTTP request to your application, but the response never came back. If you enable wire logging for your HTTP client, you can see this. Usually, I've seen this because a load balancer in front of Cloud Foundry being misconfigured. Zuul will pool connections and sometimes load balancers like to drop those connections silently. When this happens Zuul tries to reuse the bad connection, sends the request which goes off into the ether and never gets a response. With retry enabled, you'd usually see the second try succeed as it gets a new connection. – Daniel Mikusa Apr 27 '18 at 12:41
  • If you have the option to use Cloud Foundry's container to container networking, I would suggest that. It simplifies things and often makes problems like what I described go away. – Daniel Mikusa Apr 27 '18 at 12:43
  • When I check the logs of the called app in PCF, I don't see anything. It never gets a request. We're hoping to migrate to container-to-container networking as soon as possible so we may just revisit the problem when we can use that. Thanks. – Jeff May 02 '18 at 11:31
  • If the request never gets to your app, you'd need to talk with your CF Operator & possibly your network team to trace the request and see where it's being dropped. That or use C2C networking, which is a superior solution in most cases anyway. – Daniel Mikusa May 02 '18 at 11:39

0 Answers0