3

What is the request retry upstream selection algorithm?

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: rtest
spec:
  hosts:
    - '*'

  http:
    - name: test
      match:
        - uri:
            prefix: /
      route:
        - destination:
            host: myapp
            port:
              number: 8000

      retries:
        attempts: 20
        retryOn: 404,retriable-status-codes,connect-failure,reset
        retryRemoteLocalities: true
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: myapp
spec:
  host: myapp
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 100
      baseEjectionTime: 1s

    loadBalancer:
      simple: ROUND_ROBIN

I was sure that retry will try on every pod in RR manner. I was wrong, sometimes it is trying multiple times on the same pod ignoring RR balancing. How to force Istio to retry each time on the different pod?

Istio version 1.9.3


Changing the load balancing algorithm has no effect on the retry order. This is 50x retries with the 7 pods running on separate nodes:

s44sk
pgkp5
qdg58
5xjtp
5xjtp
5xjtp
blrl2
blrl2
blrl2
blrl2
blrl2
l74q4
l74q4
l74q4
l74q4
l74q4
l74q4
l74q4
l74q4
l74q4
l74q4
l74q4
l74q4
l74q4
l74q4
l74q4
l74q4
l74q4
blrl2
blrl2
s44sk
7jwdm
7jwdm
7jwdm
7jwdm
7jwdm
7jwdm
7jwdm
7jwdm
7jwdm
7jwdm
5xjtp
5xjtp
5xjtp
5xjtp

Not sure that the order is preserved but the retry ratio between pods is far from RR.

Locality load balancing is used, but I guess it should not affect the pod retry ratio using RR algorithm.

Jonas
  • 4,683
  • 4
  • 45
  • 81
  • Hello @Jonas, your config looks good. Are you sure that your traffic is not routed by Round Robin? How did you check it? This is [default policy](https://istio.io/latest/docs/reference/config/networking/destination-rule/#LoadBalancerSettings-SimpleLB). How many pod do you have? Maybe some are dead? Could you change policy to some other (i.e. `RANDOM`) and test it again? Maybe problem is not connected with istio. – Mikołaj Głodziak Jun 01 '21 at 13:29
  • Thank you for your reply. I have added more details to my question. – Jonas Jun 02 '21 at 07:06
  • Thanks for add more info. You mentioned that you are uing other load balancer. What and where is it? If you changed the load balancing algorithm and it had no effect, your problem could be connected to this load balancer. Could you turn it off and test it once again? – Mikołaj Głodziak Jun 02 '21 at 10:56
  • `istio-ingressgateway` is in a front on a NodePort and only Istio is used for traffic routing. I was talking about this: https://istio.io/latest/docs/tasks/traffic-management/locality-load-balancing/ and I guess it should not disrupt RR balancing. – Jonas Jun 02 '21 at 11:20
  • What type of `Loacality` do you want to use? [Locality Failover](https://istio.io/latest/docs/tasks/traffic-management/locality-load-balancing/failover/) or [Lacality weighted distribution](https://istio.io/latest/docs/tasks/traffic-management/locality-load-balancing/distribute/). Failover, I suppose. Be sure, that you configure only one from them. Are your nodes in different zones? Did you try to change `NodePort` to `LoadBalancer` or `ServiceEntry`? – Mikołaj Głodziak Jun 02 '21 at 15:17
  • `Locality` is used with the default settings just by setting `topology.kubernetes.io/region` and `topology.kubernetes.io/zone` labels on the nodes. I must use `NodePort` with the `istio-ingressgateway`. – Jonas Jun 03 '21 at 09:57
  • Even if I set for the `DestinationRule` `spec.trafficPolicy.loadBalancer.localityLbSetting.enabled`:`false` results are the same. So I guess locality has nothing to do with the retry pod selection algorithm. – Jonas Jun 03 '21 at 10:29
  • how did you set up your K8S cluster? Did you use some cloud providor or it is bare metal? Did you test retry ratio from cluster or outside it? You should test inside cluster. Additionally, there is work in progress on support for RetryRemoteLocalities: https://github.com/istio/istio/pull/22071 – Mikołaj Głodziak Jun 04 '21 at 13:03
  • I am using bare-metal cluster. I am testing using application logs and I can see all the applications involved in the retry sequence. I am using `retryRemoteLocalities` this feature is already released. I need to experiment directly tunning Envoy parameters. This is my related question: https://stackoverflow.com/questions/67823367/istio-envoyfilter-http-route-example – Jonas Jun 04 '21 at 13:24
  • do I understand correctly that this cluster and all nodes are on one host? – Mikołaj Głodziak Jun 07 '21 at 10:56
  • The cluster contains many hardware nodes. – Jonas Jun 07 '21 at 13:32
  • Could you test your load balancer traffic from your cluster? (You need to be inside the cluster). Are your applications connected each other? – Mikołaj Głodziak Jun 09 '21 at 13:10
  • I don't know how you tested, I don't know exactly what your expected output is. I need a test scenario to try to recreate the situation. Please add your received output. Extend it with response codes. You will also need add your expected state. What was your test scenario? In what order did you send the test requests? By the way, if a host is unavailable it can be excluded from the pool of available hosts. If the connection fails, the request will be retried round robin, however you may see different results. – Mikołaj Głodziak Jun 10 '21 at 11:29
  • It could be also problem with `outlierDetection:`. Based on the [documentation](https://istio.io/latest/docs/reference/config/networking/destination-rule/#OutlierDetection) . In your config you have `consecutive5xxErrors: 100` and `baseEjectionTime: 1s`. If I understand correctly you need to increase `baseEjectionTime` – Mikołaj Głodziak Jun 10 '21 at 12:29

0 Answers0