2

I am trying to configure envoy as load balancer, currently stuck with fallbacks. In my playground cluster I have 3 backend servers and envoy as front proxy. I am generating some traffic on envoy using siege and watching the responses. While doing this I stop one of the backends.

What do I want: Envoy should resend failed requests from stopped backend to healthy one, so I will not get any 5xx responses

What do I get: When stopping backend I get some 503 responses, and then everything becomes normal again

What am I doing wrong? I think, fallback_policy should provide this functionality, but it doesn't work.

Here is my config file:

node:
  id: LoadBalancer_01
  cluster: HighloadCluster
admin:
  access_log_path: /var/log/envoy/admin_access.log
  address:
    socket_address: { address: 0.0.0.0, port_value: 9901 }
static_resources:
  listeners:
    - name: http_listener
      address:
        socket_address: { address: 0.0.0.0, port_value: 80 }
      filter_chains:
        - filters:
            - name: envoy.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager
                stat_prefix: ingress_http
                codec_type: AUTO
                route_config:
                  name: request_route
                  virtual_hosts:
                    - name: local_service
                      domains: ["*"]
                      require_tls: NONE
                      routes:
                        - match: { prefix: "/" }
                          route:
                            cluster: backend_service
                            timeout: 1.5s
                          retry_policy:
                            retry_on: 5xx
                            num_retries: 3
                            per_try_timeout: 3s
                http_filters:
                  - name: envoy.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.config.filter.http.router.v2.Router
                      name: envoy.file_access_log
                      typed_config:
                        "@type": type.googleapis.com/envoy.config.accesslog.v2.FileAccessLog
                        path: /var/log/envoy/access.log
  clusters:
    - name: backend_service
      connect_timeout: 0.25s
      type: STATIC
      lb_policy: ROUND_ROBIN
      lb_subset_config:
        fallback_policy: ANY_ENDPOINT
      outlier_detection:
        consecutive_5xx: 1
        interval: 10s
      load_assignment:
        cluster_name: backend_service
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 1.1.1.1
                      port_value: 10000
              - endpoint:
                  address:
                    socket_address:
                      address: 2.2.2.2
                      port_value: 10000
              - endpoint:
                  address:
                    socket_address:
                      address: 3.3.3.3
                      port_value: 10000
      health_checks:
          - http_health_check:
              path: /api/liveness-probe
            timeout: 1s
            interval: 30s
            unhealthy_interval: 10s
            unhealthy_threshold: 2
            healthy_threshold: 1
            always_log_health_check_failures: true
            event_log_path: /var/log/envoy/health_check.log```

1 Answers1

1

TL;DR

You can use a circuit breaker (see config example below), alongside with your retry_policy and outlier_detection.

Explanation

Context

I have successfully reproduced your issue with your config (except the health_checks part, because I found that it was not useful to reproduce your problem).

I have run envoy and my backend (2 apps load-balanced), generated some traffic with hey (50 threads making requests concurrently during 10 seconds):

hey -c 50 -z 10s http://envoy:8080

And I have stopped one backend app around 5s after the command started.

Result

When diving into envoy admin /stats endpoint, I noticed interesting stuff:

cluster.backend_service.upstream_rq_200: 17899
cluster.backend_service.upstream_rq_503: 28

cluster.backend_service.upstream_rq_retry_overflow: 28
cluster.backend_service.upstream_rq_retry_success: 3
cluster.backend_service.upstream_rq_total: 17930

There were indeed 28 503 responses when I stopped one backend app. But retry worked somehow: 3 retries were successful (upstream_rq_retry_success), but 28 other retries failed (upstream_rq_retry_overflow), resulting to 503 errors. Why ?

From the cluster stats docs:

upstream_rq_retry_overflow : Total requests not retried due to circuit breaking or exceeding the retry budget

Fix

To solve this, we can add a circuit breaker in the cluster (I have been generous with max_requests, max_pending_requests and max_retries parameters for the example). The interesting part is retry_budget.budget_percent value:

  clusters:
  - name: backend_service
    connect_timeout: 0.25s
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    outlier_detection:
      consecutive_5xx: 1
      interval: 10s
    circuit_breakers:
      thresholds:
      - priority: "DEFAULT"
        max_requests: 0xffffffff
        max_pending_requests: 0xffffffff
        max_retries: 0xffffffff
        retry_budget:
          budget_percent:
            value: 100.0

From the retry_budget docs:

budget_percent: Specifies the limit on concurrent retries as a percentage of the sum of active requests and active pending requests. For example, if there are 100 active requests and the budget_percent is set to 25, there may be 25 active retries.

This parameter is optional. Defaults to 20%.

I set it to 100.0 to allow 100% of active retries.

When running the example again with this new config, there is no more upstream_rq_retry_overflow, so no more 503 errors:

cluster.backend_service.upstream_rq_200: 17051

cluster.backend_service.upstream_rq_retry_overflow: 0
cluster.backend_service.upstream_rq_retry_success: 5
cluster.backend_service.upstream_rq_total: 17056

Note that if you experience upstream_rq_retry_limit_exceeded, you can try to set and increase retry_budget.min_retry_concurrency (default when not set is 3):

        retry_budget:
          budget_percent:
            value: 100.0
          min_retry_concurrency: 10
norbjd
  • 10,166
  • 4
  • 45
  • 80