0

I have a local cluster of 3 CentOs servers, I've installed Keepalived on each server, Then I run some benchmark tests with ab like this :
ab -c 1000 -n 100000 -r host

Then in the middle of benchmark test I poweroff the Master Server, and Keepalived changes the owner of floating ip to one of backup servers, but this process needs a little time and hence i have some failed requests. My question is how to minimize this downtime? and is there anyway to design some clusters that have no downtime at all while taking down one node?

this is my keepalived configuration :

! Configuration File for keepalived

global_defs {
   notification_email {
     user@localhost
   }
   notification_email_from root@localhost
   smtp_server 127.0.0.1
   smtp_connect_timeout 30
   router_id LVS_DEVEL
}

vrrp_script health_check {
  script       "curl host"
  interval 2   # check every 2 seconds
  fall 2       # require 2 failures for KO
  rise 2       # require 2 successes for OK
}

vrrp_instance VI_1 {
    state BACKUP
    interface enp0s3
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass password
    }
    virtual_ipaddress {
        <host ip>
    }

    track_script {
        health_check
    }
}    

and here is result of my benchmark:

Concurrency Level:      1000
Time taken for tests:   25.502 seconds
Complete requests:      100000
Failed requests:        7618
   (Connect: 0, Receive: 2539, Length: 2539, Exceptions: 2540)
Write errors:           0
Total transferred:      13644540 bytes
HTML transferred:       2241603 bytes
Requests per second:    3921.28 [#/sec] (mean)
Time per request:       255.019 [ms] (mean)
Time per request:       0.255 [ms] (mean, across all concurrent requests)
Transfer rate:          522.50 [Kbytes/sec] received

which shows that nearly it takes 2 seconds to change the owner of virtual ip and handle requests. what can i do to minimize this time and ideally have no downtime if it is possible.

Jason Martin
  • 5,023
  • 17
  • 24
Mairon
  • 159
  • 3
  • 12

1 Answers1

1

Basically it hardly to avoid downtime even in hardware load balancer, it need time to detect that master is down and migrate the VIP address.

You can minimize the down time by adjusting keepalived heartbeat frequency (advert_int in second)

The fail over from the MASTER to the BACKUP is triggered when the BACKUP server doesn't recieve the VRRP advertisment from the MASTER for 3x the period defined in the "advert_int" option.

Try to set low advert_int (<1),be careful not to trigger failover due to network timeout.

You can set session persistence/replication in application layer so user won't be affected by the fail over.

hnajib
  • 56
  • 1
  • This is spot on. There's no way for a network device failure to be error-free unless it has the cooperation of the upstream client, such that it retries requests that fail. – Jason Martin May 31 '17 at 13:42