Informix HDR failover doesn't work when primary DB server and Connection Manager/ICM are terminated simultaneously then ICM is brought back online

Question

Automatic switchover is configured in a primary/secondary HDR server setup.

The HDR cluster would be connected and replication is on. If the service is stopped gracefully on the primary server (NODE-A) or the VM is terminated, failover happens and the secondary server (NODE-B) becomes primary.

The following works ok as long as Connection Manager/ICM is up:

service informix stop

or

onmade -yuk

or simply shutdown/terminate the primary DB node (NODE-A)

However, if in the case of a DR event where both the primary DB server (NODE-A) and Connection Manager/ICM went down at the same time, then the Connection Manager machine was brought back up, it does not failover to the secondary node (NODE-B)

On the Informix servers awwdst10a and awwdst10b I have the following replication parameters

DRAUTO          3
DRINTERVAL      0
DRTIMEOUT       30
HA_ALIAS
HA_FOC_ORDER    
DRLOSTFOUND     /logs/informix/t10/
DRIDXAUTO       1
LOG_INDEX_BUILDS 1
SDS_ENABLE      0
SDS_TIMEOUT     20
SDS_PAGING
UPDATABLE_SECONDARY 0
FAILOVER_CALLBACK
TEMPTAB_NOLOG   1

The sqlhosts file for both Informix servers is:

cluster_1 group - - i=10
awwdst10a_t10 onsoctcp awwdst10a informix_t10 g=cluster_1
awwdst10b_t10 onsoctcp awwdst10b informix_t10 g=cluster_1
sla_cluster_1 onsoctcp localhost informix_t10

Replication is setup and working, on the secondary awwdst10b I run:

onstat -g cluster 

IBM Informix Dynamic Server Version 12.10.FC10 -- Read-Only (Sec) -- Up 02:42:00 -- 1556768 Kbytes

Primary Server:awwdst10a_t10
Index page logging status: Enabled
Index page logging was enabled at: 2018/06/06 09:47:35


Server        ACKed Log    Supports     Status
              (log, page)  Updates
awwdst10b_t10 54494,1135   No           ASYNC(HDR),Connected,On

on the primary awwdst10a I run:

$ onstat -g cluster

IBM Informix Dynamic Server Version 12.10.FC10 -- On-Line (Prim) -- Up 02:37:16 -- 1556768 Kbytes

Primary Server:awwdst10a_t10
Current Log Page:54494,1135
Index page logging status: Enabled
Index page logging was enabled at: 2018/06/06 09:47:35


Server        ACKed Log    Applied Log  Supports     Status
              (log, page)  (log, page)  Updates 
awwdst10b_t10 54494,1135   54494,1135   No      ASYNC(HDR),Connected,On

So replication is working and the servers are synchronized.

On a third machine awguht10 I setup Connection Manager/ICM with the following parameters:

NAME whics-t10
DEBUG 1
LOG 1
LOGFILE /logs/informix/cmlog
CM_TIMEOUT 60
EVENT_TIMEOUT 60
SECONDARY_EVENT_TIMEOUT 60
SQLHOSTS local

CLUSTER cluster_1
{
   INFORMIXSERVER cluster_1
   SLA sla_cluster_1     DBSERVERS=primary
   FOC ORDER=ENABLED \
       PRIORITY=1
   #CMALARMPROGRAM $INFORMIXDIR/etc/cmalarmprogram.sh
}

And the sqlhosts file for the connection manager:

cluster_1 group - - i=10
awwdst40a_t40 onsoctcp awwdst40a informix_t10 g=cluster_1
awwdst40b_t40 onsoctcp awwdst40b informix_t10 g=cluster_1
sla_cluster_whics onsoctcp localhost informix_t10

Now I forcefully bring down awwdst10a an I get the following on the online log of Informix server awwdst10b:

Mon Oct 22 22:46:30 2018

22:46:30  Insufficient free huge pages in /proc/meminfo for shared memory segment.
          Requested: 56774656 bytes.  Available: 0 bytes.
          The default memory page size will be used.
22:46:30  Requested shared memory segment size rounded from 55443KB to 55444KB
22:46:30  Segment locked: addr=0x73877000, size=56774656
22:46:30  Successfully added a bufferpool of page size 2K.

22:46:30  Insufficient free huge pages in /proc/meminfo for shared memory segment.
          Requested: 107974656 bytes.  Available: 0 bytes.
          The default memory page size will be used.
22:46:30  Requested shared memory segment size rounded from 105443KB to 105444KB
22:46:30  Segment locked: addr=0x76e9c000, size=107974656
22:46:30  Successfully added a bufferpool of page size 4K.

22:46:30  Insufficient free huge pages in /proc/meminfo for shared memory segment.
          Requested: 210374656 bytes.  Available: 0 bytes.
          The default memory page size will be used.
22:46:30  Requested shared memory segment size rounded from 205443KB to 205444KB
22:46:30  Segment locked: addr=0x7d595000, size=210374656
22:46:30  Successfully added a bufferpool of page size 8K.

22:46:30  Insufficient free huge pages in /proc/meminfo for shared memory segment.
          Requested: 415174656 bytes.  Available: 0 bytes.
          The default memory page size will be used.
22:46:30  Requested shared memory segment size rounded from 405443KB to 405444KB
22:46:30  Segment locked: addr=0x89e36000, size=415174656
22:46:30  Successfully added a bufferpool of page size 16K.
22:46:30  Warning: ONCONFIG dump directory (DUMPDIR) '/dump/informix/t10' has insecure permissions
22:46:30  Event alarms enabled.  ALARMPROG = '/usr/informix/etc/no_log.sh'
22:46:30  Booting Language <c> from module <>
22:46:30  Loading Module <CNULL>
22:46:30  Booting Language <builtin> from module <>
22:46:30  Loading Module <BUILTINNULL>
22:46:36  Affinitied VP 1 to phys proc 1
22:46:36  DR: DRAUTO is 3 (CMSM)
22:46:36  DR: ENCRYPT_HDR is 0 (HDR encryption Disabled)
22:46:36  Event notification facility epoll enabled.
22:46:36  Trusted host cache successfully built:/etc/hosts.equiv.
22:46:36  CCFLAGS2 value set to 0x200
22:46:36  SQL_FEAT_CTRL value set to 0x8008
22:46:36  SQL_DEF_CTRL value set to 0x4b0
22:46:36  IBM Informix Dynamic Server Version 12.10.FC10 Software Serial Number AAA#B000000
22:46:37  Performance Advisory: The current size of the physical log buffer is smaller than recommended.
22:46:37   Results: Transaction performance might not be optimal.
22:46:37   Action: For better performance, increase the physical log buffer size to 128.
22:46:38  IBM Informix Dynamic Server Initialized -- Shared Memory Initialized.

22:46:38  Started 1 B-tree scanners.
22:46:38  B-tree scanner threshold set at 5000.
22:46:38  B-tree scanner range scan size set to -1.
22:46:38  B-tree scanner ALICE mode set to 6.
22:46:38  B-tree scanner index compression level set to med.
22:46:38  Physical Recovery Started at Page (3:164234).
22:46:38  Physical Recovery Complete: 0 Pages Examined, 0 Pages Restored.
22:46:38  DR: Trying to connect to primary server = awwdst10a_t10
22:48:39  DR: Cannot connect to primary server
22:48:39  DR: Turned off on secondary server
22:48:39  Dataskip is now OFF for all dbspaces
22:48:39  Restartable Restore has been ENABLED
22:48:39  Recovery Mode
23:23:50  Booting Language <spl> from module <>
23:23:50  Loading Module <SPLNULL>
23:24:12  CM:Connection manager whics-t10 registered with the server

Connection Manager - post failure

Mon Oct 22 23:23:44 2018
23:23:44 IBM Informix  Connection Manager
23:23:44 IBM Informix CSDK Version 4.10, IBM Informix-ESQL Version 4.10.FC10 
23:23:44 Build Number:  N046
23:23:44 Build Host:    hans
23:23:44 Build OS:      Linux 2.6.18-128.el5
23:23:44 Build Date:    Mon Oct 23 07:06:25 CDT 2017
23:23:44 GLS Version:   glslib-6.00.FC11
23:23:44 
23:23:44 set CM_TIMEOUT 60
23:23:44 set global EVENT_TIMEOUT 60
23:23:44 set global SECONDARY_EVENT_TIMEOUT 60
23:23:44 DEBUG[TID139636001822496]:add type primary to SLA sla_cluster_1 [cmsm_sla.c:parse_sla:2412]
23:23:44 SLA sla_cluster_1 listener mode REDIRECT
23:23:44 Connection Manager name is whics-t10
23:23:44 Starting Connection Manager...
23:23:44 DEBUG[TID139636001822496]:Password File /opt/informix-12.10.fc10/etc/passwd_file failed error:No such file or directory[2] [cmsm_main.c:cmsm_pw_init:1782]
23:23:44 Warning: Password Manager failed; working in trusted node mode
23:23:44 the current maximum number of file descriptors is 32767
23:23:44 DEBUG[TID139636001822496]:new daemon pid is 3804 [cmsm_main.c:cmsm_daemonize:4653]
23:23:44 DEBUG[TID139636001822496]:dbservername = cluster_1 [cmsm_server.ec:cmsm_primary:2819]
23:23:44 DEBUG[TID139636001822496]:nettype =  [cmsm_server.ec:cmsm_primary:2820]
23:23:44 DEBUG[TID139636001822496]:hostname = - [cmsm_server.ec:cmsm_primary:2821]
23:23:44 DEBUG[TID139636001822496]:servicename = - [cmsm_server.ec:cmsm_primary:2822]
23:23:44 DEBUG[TID139636001822496]:options = i=10 [cmsm_server.ec:cmsm_primary:2823]
23:23:44 DEBUG[TID139636001822496]:connect to cluster_1 for CLUSTER cluster_1 [cmsm_server.ec:cmsm_primary:2839]
23:23:44 DEBUG[TID139636001822496]:connect to group cluster_1 server awwdst10a_t10 for CLUSTER cluster_1 [cmsm_server.ec:cmsm_primary:2860]
23:23:44 DEBUG[TID139636001822496]:create new thread -1995749632 for awwdst10a_t10 [cmsm_sla.c:cmsm_update_server_ex:996]
23:23:44 DEBUG[TID139636001822496]:connect to group cluster_1 server awwdst10b_t10 for CLUSTER cluster_1 [cmsm_server.ec:cmsm_primary:2860]
23:23:44 DEBUG[TID139636001822496]:create new thread -1997850880 for awwdst10b_t10 [cmsm_sla.c:cmsm_update_server_ex:996]
23:23:44 listener sla_cluster_1 initializing
23:23:44 DEBUG[TID139635976775424]:listenter sla_cluster_1 host=localhost service=informix_t10 nettype=soctcp engtypeon [cmsm_main.c:cmsm_listener_thread:1166]
23:23:49 DEBUG[TID139635980977920]:CONNECT to @awwdst10a_t10|onsoctcp|awwdst10a|informix_t10 AS awwdst10a_t101   SQLCODE = (-908,107,awwdst10a_t10) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
23:23:49 ALARM 4001 unable to connect to Informix server awwdst10a_t10 from awguht10 error -908
23:23:49 DEBUG[TID139635980977920]:Arbitrator return, server awwdst10a_t10 is not primary [cmsm_arb.c:arb_server_down:1399]
23:23:50 DEBUG[TID139635978876672]:CONNECT to @awwdst10b_t10|onsoctcp|awwdst10b|informix_t10 AS awwdst10b_t101   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
23:23:50 DEBUG[TID139635978876672]:database sysmaster @awwdst10b_t10|onsoctcp|awwdst10b|informix_t10 AS awwdst10b_t101   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:145]
23:23:50 Listener sla_cluster_1 DBSERVERS=primary is active with 4 worker threads
23:23:50 DEBUG[TID139635978876672]:awwdst10b_t10 protocols = 1 [cmsm_server.ec:cmsm_add_dbaliases:2427]
23:23:50 DEBUG[TID139635978876672]:skip awwdst10b_t10 alias awwdst10b_t10_drda with protocol  1 [cmsm_server.ec:cmsm_add_dbaliases:2694]
23:23:50 DEBUG[TID139635978876672]:SQL fetch sysconfig_cursor sqlcode = (100,0,) [cmsm_server.ec:cmsm_add_dbaliases:2540]
23:23:52 DEBUG[TID139635980977920]:CONNECT to @awwdst10a_t10|onsoctcp|awwdst10a|informix_t10 AS awwdst10a_t102   SQLCODE = (-908,107,awwdst10a_t10) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
23:23:52 DEBUG[TID139635980977920]:Arbitrator return, server awwdst10a_t10 is not primary [cmsm_arb.c:arb_server_down:1399]
23:24:12 DEBUG[TID139635980977920]:Monitor awwdst10a_t10 exit by arbitrator  [cmsm_er.ec:cmsm_er_monitor:2766]
23:24:12 DEBUG[TID139636001822496]:fcntl(/opt/informix-12.10.fc10/tmp/cmsm.pid.whics-t10) success error:Success[0] [cmsm_main.c:cmsm_pidfile:2193]
23:24:12 DEBUG[TID139635968210688]:Warning: no server returned for SLA sla_cluster_1 [cmsm_sla.c:select_server_from_sla:3037]
**23:24:12 Connection Manager successfully connected to awwdst10b_t10**
**23:24:12 DEBUG[TID139635966109440]:Warning: no server returned for SLA sla_cluster_1 [cmsm_sla.c:select_server_from_sla:3037]**
**23:24:12 DEBUG[TID139635966109440]:SLA sla_cluster_1 cannot find database server for SQLI client  [cmsm_main.c:cmsm_process_sqli:1522]**
23:24:12 Connection Manager started successfully
23:24:12 DEBUG[TID139635980977920]:all monitors exited, server awwdst10a_t10 register count = 0 [cmsm_er.ec:cmsm_er_monitor:3133]
23:24:12 DEBUG[TID139635968210688]:SLA sla_cluster_1 cannot find database server for SQLI client  [cmsm_main.c:cmsm_process_sqli:1522]
23:24:12 The server type of cluster cluster_1 server awwdst10a_t10 is Primary.
23:24:12 DEBUG[TID139635978876672]:create new thread -1995749632 for awwdst10a_t10 [cmsm_sla.c:cmsm_update_server_ex:996]
23:24:12 DEBUG[TID139635978876672]:cluster cluster_1 primary is not active [cmsm_arb.c:arb_set_primary:774]
23:24:12 DEBUG[TID139635978876672]:server awwdst10b_t10 version is 12.10 [cmsm_er.ec:cmsm_er_monitor:2971]
23:24:12 DEBUG[TID139635978876672]:ACTIVE MONITOR awwdst10b_t10 awwdst10b_t10 172.29.26.122 53089 [cmsm_er.ec:cmsm_er_monitor:3013]
23:24:12 DEBUG[TID139635978876672]:register CM successfully whics-t10 with token 1838996547 [cmsm_server.ec:cmsm_cm_register:2330]
23:24:12 DEBUG[TID139635978876672]:Register server awwdst10b_t10 for awwdst10b_t10 count = 1 [cmsm_er.ec:cmsm_er_event_process:1280]
23:24:12 DEBUG[TID139635978876672]:get awwdst10b_t10 CLUST_CHG event 1:4[DISCONNECT] awwdst10a_t10|Primary| [cmsm_er.ec:cmsm_er_event_process:1733]
23:24:17 DEBUG[TID139635980977920]:CONNECT to @awwdst10a_t10|onsoctcp|awwdst10a|informix_t10 AS awwdst10a_t103   SQLCODE = (-908,107,awwdst10a_t10) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
23:24:17 DEBUG[TID139635980977920]:Arbitrator return, server awwdst10a_t10 is not a DBSERVERALIAS of primary [cmsm_arb.c:arb_server_down:1417]
23:24:17 DEBUG[TID139635980977920]:Monitor awwdst10a_t10 exit by arbitrator  [cmsm_er.ec:cmsm_er_monitor:2766]
23:24:17 DEBUG[TID139635980977920]:all monitors exited, server awwdst10a_t10 register count = 0 [cmsm_er.ec:cmsm_er_monitor:3133]
23:24:46 The server type of cluster cluster_1 server awwdst10a_t10 is Primary.
23:24:46 DEBUG[TID139635978876672]:create new thread -1995749632 for awwdst10a_t10 [cmsm_sla.c:cmsm_update_server_ex:996]
23:24:52 DEBUG[TID139635980977920]:CONNECT to @awwdst10a_t10|onsoctcp|awwdst10a|informix_t10 AS awwdst10a_t104   SQLCODE = (-908,107,awwdst10a_t10) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
23:24:52 DEBUG[TID139635980977920]:Arbitrator return, server awwdst10a_t10 is not a DBSERVERALIAS of primary [cmsm_arb.c:arb_server_down:1417]
23:24:52 DEBUG[TID139635980977920]:Monitor awwdst10a_t10 exit by arbitrator  [cmsm_er.ec:cmsm_er_monitor:2766]
23:24:52 DEBUG[TID139635980977920]:all monitors exited, server awwdst10a_t10 register count = 0 [cmsm_er.ec:cmsm_er_monitor:3133]
23:25:02 DEBUG[TID139635896350464]:Warning: no server returned for SLA sla_cluster_1 [cmsm_sla.c:select_server_from_sla:3037]
23:25:02 DEBUG[TID139635896350464]:SLA sla_cluster_1 cannot find database server for SQLI client  [cmsm_main.c:cmsm_process_sqli:1522]
23:25:20 The server type of cluster cluster_1 server awwdst10a_t10 is Primary.
23:25:23 DEBUG[TID139635980977920]:Arbitrator return, server awwdst10a_t10 is not a DBSERVERALIAS of primary [cmsm_arb.c:arb_server_down:1417]
23:25:23 DEBUG[TID139635980977920]:Monitor awwdst10a_t10 exit by arbitrator  [cmsm_er.ec:cmsm_er_monitor:2766]
23:25:23 DEBUG[TID139635980977920]:all monitors exited, server awwdst10a_t10 register count = 0 [cmsm_er.ec:cmsm_er_monitor:3133]

Connection Manager - pre failure

    17:09:02 IBM Informix  Connection Manager
    17:09:02 IBM Informix CSDK Version 4.10, IBM Informix-ESQL Version 4.10.FC10 
   .....
    17:09:02 set CM_TIMEOUT 60
    17:09:02 set global EVENT_TIMEOUT 60
    17:09:02 set global SECONDARY_EVENT_TIMEOUT 60
    17:09:02 DEBUG[TID139992173627168]:add type primary to SLA sla_cluster_1 [cmsm_sla.c:parse_sla:2412]
    17:09:02 SLA sla_cluster_1 listener mode REDIRECT
    17:09:02 Connection Manager name is whics-t10
    17:09:02 Starting Connection Manager...
    17:09:02 DEBUG[TID139992173627168]:dbservername = cluster_1 [cmsm_server.ec:cmsm_primary:2819]
    17:09:02 DEBUG[TID139992173627168]:nettype =  [cmsm_server.ec:cmsm_primary:2820]
    17:09:02 DEBUG[TID139992173627168]:hostname = - [cmsm_server.ec:cmsm_primary:2821]
    17:09:02 DEBUG[TID139992173627168]:servicename = - [cmsm_server.ec:cmsm_primary:2822]
    17:09:02 DEBUG[TID139992173627168]:options = i=10 [cmsm_server.ec:cmsm_primary:2823]
    17:09:02 DEBUG[TID139992173627168]:connect to cluster_1 for CLUSTER cluster_1 [cmsm_server.ec:cmsm_primary:2839]
    17:09:02 DEBUG[TID139992173627168]:connect to group cluster_1 server awwdst10b_t10 for CLUSTER cluster_1 [cmsm_server.ec:cmsm_primary:2860]
    17:09:02 DEBUG[TID139992173627168]:create new thread 1986635520 for awwdst10b_t10 [cmsm_sla.c:cmsm_update_server_ex:996]
    17:09:02 listener sla_cluster_1 initializing
    17:09:02 DEBUG[TID139992152782592]:CONNECT to @awwdst10a_t10|onsoctcp|awwdst10a|informix_t10 AS awwdst10a_t101   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
    17:09:02 DEBUG[TID139992152782592]:database sysmaster @awwdst10a_t10|onsoctcp|awwdst10a|informix_t10 AS awwdst10a_t101   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:145]
    17:09:02 DEBUG[TID139992152782592]:awwdst10a_t10 protocols = 1 [cmsm_server.ec:cmsm_add_dbaliases:2427]
    17:09:02 DEBUG[TID139992152782592]:skip awwdst10a_t10 alias awwdst10a_t10_drda with protocol  1 [cmsm_server.ec:cmsm_add_dbaliases:2694]
  .......
    setting primary name = awwdst10a_t10 [cmsm_arb.c:arb_set_primary:795]
    17:09:02 Cluster cluster_1 Arbitrator FOC ORDER=ENABLED        PRIORITY=1
    17:09:02 Connection Manager successfully connected to awwdst10a_t10
    17:09:02 The server type of cluster cluster_1 server awwdst10b_t10 is HDR.
    17:09:02 DEBUG[TID139992150681344]:CONNECT to @awwdst10b_t10|onsoctcp|awwdst10b|informix_t10 AS awwdst10b_t101   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
    17:09:02 DEBUG[TID139992152782592]:server awwdst10a_t10 version is 12.10 [cmsm_er.ec:cmsm_er_monitor:2971]
    17:09:02 DEBUG[TID139992152782592]:register CM successfully whics-t10 with token 216854841 [cmsm_server.ec:cmsm_cm_register:2330]
    17:09:02 DEBUG[TID139992152782592]:Register server awwdst10a_t10 for awwdst10a_t10 count = 1 [cmsm_er.ec:cmsm_er_event_process:1280]
    17:09:03 DEBUG[TID139992150681344]:database sysmaster @awwdst10b_t10|onsoctcp|awwdst10b|informix_t10 AS awwdst10b_t101   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:145]
    17:09:03 Listener sla_cluster_1 DBSERVERS=primary is active with 4 worker threads
    17:09:03 DEBUG[TID139992150681344]:awwdst10b_t10 protocols = 1 [cmsm_server.ec:cmsm_add_dbaliases:2427]
    17:09:03 DEBUG[TID139992150681344]:skip awwdst10b_t10 alias awwdst10b_t10_drda with protocol  1 [cmsm_server.ec:cmsm_add_dbaliases:2694]
    17:09:14 Connection Manager successfully connected to awwdst10b_t10
    17:09:14 DEBUG[TID139992173627168]:fcntl(/opt/informix-12.10.fc10/tmp/cmsm.pid.whics-t10) success error:Success[0] [cmsm_main.c:cmsm_pidfile:2193]
    17:09:14 Connection Manager started successfully
    17:09:14 SLA sla_cluster_1 redirect SQLI client from ::1 to awwdst10a_t10 172.29.26.129.51001
    17:09:14 DEBUG[TID139992150681344]:server awwdst10b_t10 version is 12.10 [cmsm_er.ec:cmsm_er_monitor:2971]
  ..........
    17:11:01 SLA sla_cluster_1 redirect SQLI client from ::1 to awwdst10a_t10 172.29.26.129.51001

awwdst10a - pre failure

16:47:21  IBM Informix Dynamic Server Started.
16:47:21  Requested shared memory segment size rounded from 275020KB to 280576K

Mon Oct 22 16:47:23 2018


16:47:23  Requested shared memory segment size rounded from 405443KB to 405504KB
16:47:23  Shared memory segment will use huge pages.
16:47:23  Segment locked: addr=0x8a600000, size=415236096
16:47:23  Successfully added a bufferpool of page size 16K.
......
16:47:28  Affinitied VP 1 to phys proc 1
16:47:28  DR: DRAUTO is 3 (CMSM)
16:47:28  DR: ENCRYPT_HDR is 0 (HDR encryption Disabled)
16:47:28  Event notification facility epoll enabled.
16:47:28  Trusted host cache successfully built:/etc/hosts.equiv.
16:47:28  IBM Informix Dynamic Server Version 12.10.FC10 Software 
16:47:29  Performance Advisory: The current size of the physical log buffer is smaller than recommended.
16:47:29   Results: Transaction performance might not be optimal.
16:47:29   Action: For better performance, increase the physical log buffer size to 128.
16:47:30  IBM Informix Dynamic Server Initialized -- Shared Memory Initialized.

16:47:30  Started 1 B-tree scanners.
16:47:30  B-tree scanner threshold set at 5000.
16:47:30  B-tree scanner range scan size set to -1.
16:47:30  B-tree scanner ALICE mode set to 6.
16:47:30  B-tree scanner index compression level set to med.
16:47:30  DR: Reservation of the last logical log for log backup turned on
16:47:30  DR: Trying to connect to secondary server = awwdst10b_t10
16:47:30  DR: Cannot connect to secondary server
16:47:30  DR: Turned off on primary server
16:47:30  Physical Recovery Started at Page (3:152094).
16:47:30  Physical Recovery Complete: 0 Pages Examined, 0 Pages Restored.
16:47:30  Logical Recovery Started.
16:47:30  10 recovery worker threads will be started.
16:47:30  Logical Recovery has reached the transaction cleanup phase.
16:47:30  Logical Recovery Complete.
          0 Committed, 0 Rolled Back, 0 Open, 0 Bad Locks

16:47:31  Dataskip is now OFF for all dbspaces
16:47:32  Checkpoint Completed:  duration was 0 seconds.
16:47:32  Mon Oct 22 - loguniq 54478, logpos 0xe30c0, timestamp: 0xd41004bc Interval: 399667

16:47:39  Logical Log 54478 Complete, timestamp: 0xd41102e7.
16:47:41   Action: Increase the size of the individual logical log files so
          that it takes at least 30 seconds to fill each one. Look at the
          online log to determine how quickly the log files are filling, and
          then increase the size of the files proportionately.
16:47:41  Logical Log 54479 Complete, timestamp: 0xd411f36b.
16:47:44  DR: Primary server connected
16:47:44  DR: Secondary server needs failure recovery

16:47:46  DR_ERR set to -1
16:47:46  DR: Failure recovery error (14)
16:47:46  Logical Log 54480 Complete, timestamp: 0xd4131922.
16:47:47  DR: Turned off on primary server
16:47:47  DR: Cannot connect to secondary server
16:47:54  Logical Log 54483 Complete, timestamp: 0xd415bc16.
16:47:58  DR: Primary server connected
16:47:58  DR: Secondary server needs failure recovery

16:48:00  DR: Sending log 54478, size 2500 pages, 100.00 percent used
16:48:00  Logical Log 54484 Complete, timestamp: 0xd4172711.
.
16:48:05  Logical Log 54485 Complete, timestamp: 0xd4180cc0.
16:48:07  DR: Sending log 54482, size 2500 pages, 100.00 percent used
16:48:20  Checkpoint Completed:  duration was 1 seconds.
17:09:02  CM:Connection manager whics-t10 registered with the server
17:09:18  Logical Log 54478 - Backup Started
17:09:22  Logical Log 54488 - Backup Completed
17:09:22  Logical Log 54489 - Backup Started
17:33:32  Checkpoint Completed:  duration was 0 seconds.
17:33:32  Mon Oct 22 - loguniq 54494, logpos 0x3aa018, timestamp: 0xd41e0c9d Interval: 399671

17:33:32  Maximum server connections 2 
17:33:32  Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 24, Llog used 9

awwdst10b - pre failure

16:47:46  IBM Informix Dynamic Server Started.

Mon Oct 22 16:47:47 2018


16:47:48  Requested shared memory segment size rounded from 405443KB to 405504KB
16:47:48  Shared memory segment will use huge pages.
16:47:48  Segment locked: addr=0x8a600000, size=415236096
16:47:48  Successfully added a bufferpool of page size 16K.

16:47:48  Warning: ONCONFIG dump directory (DUMPDIR) '/dump/informix/t10' has insecure permissions
16:47:48  Event alarms enabled.  ALARMPROG = '/usr/informix/etc/no_log.sh'
16:47:48  Booting Language <c> from module <>
16:47:48  Loading Module <CNULL>
16:47:48  Booting Language <builtin> from module <>
16:47:48  Loading Module <BUILTINNULL>
16:47:53  Affinitied VP 1 to phys proc 1
16:47:53  DR: DRAUTO is 3 (CMSM)
16:47:53  DR: ENCRYPT_HDR is 0 (HDR encryption Disabled)
16:47:53  Event notification facility epoll enabled.
16:47:53  Trusted host cache successfully built:/etc/hosts.equiv.
16:47:53  CCFLAGS2 value set to 0x200
16:47:53  SQL_FEAT_CTRL value set to 0x8008
16:47:53  SQL_DEF_CTRL value set to 0x4b0
16:47:53  IBM Informix Dynamic Server Version 12.10.FC10 Software Serial Number AAA#B000000
16:47:54  Performance Advisory: The current size of the physical log buffer is smaller than recommended.
16:47:54   Results: Transaction performance might not be optimal.
16:47:54   Action: For better performance, increase the physical log buffer size to 128.
16:47:54  IBM Informix Dynamic Server Initialized -- Shared Memory Initialized.

16:47:54  Started 1 B-tree scanners.
16:47:54  B-tree scanner threshold set at 5000.
16:47:54  B-tree scanner range scan size set to -1.
16:47:54  B-tree scanner ALICE mode set to 6.
16:47:54  B-tree scanner index compression level set to med.
16:47:54  Physical Recovery Started at Page (3:151957).
16:47:54  Physical Recovery Complete: 0 Pages Examined, 0 Pages Restored.
16:47:54  DR: Trying to connect to primary server = awwdst10a_t10
16:47:54  smx creates 1 transports to server awwdst10a_t10
16:47:55  Dataskip is now OFF for all dbspaces
16:47:55  Restartable Restore has been ENABLED
16:47:55  Recovery Mode
16:47:57  DR: Secondary server connected
16:47:58  DR: Secondary server needs failure recovery

16:47:59  DR: Failure recovery from disk in progress ... 
16:47:59  Logical Recovery Started.
16:47:59  10 recovery worker threads will be started.
16:47:59  Start Logical Recovery - Start Log 54478, End Log ?
16:47:59  Starting Log Position - 54478 0xdd018
16:48:00  Started processing open transactions on secondary during startup
          Server in fast recovery until these open transactions finish
16:48:00  Finished processing open transactions on secondary during startup.
16:48:00  Checkpoint Completed:  duration was 0 seconds.
16:48:00  Mon Oct 22 - loguniq 54478, logpos 0xdf0c0, timestamp: 0xd410117e Interval: 399666

16:48:00  Checkpoint Completed:  duration was 0 seconds.
16:48:00  Mon Oct 22 - loguniq 54478, logpos 0xe30c0, timestamp: 0xd41011cf Interval: 399668

16:48:00  Maximum server connections 0 
16:48:00  Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 2, Llog used 0
16:48:03  Logical Log 54478 Complete, timestamp: 0xd410f549.
16:48:17  Logical Log 54486 Complete, timestamp: 0xd418e8e4.
16:48:19  B-tree scanners disabled.
16:48:20  DR: HDR secondary server operational
16:48:20  Checkpoint Completed:  duration was 0 seconds.
16:48:20  Mon Oct 22 - loguniq 54487, logpos 0xd018, timestamp: 0xd418e973 Interval: 399669
16:48:20  Maximum server connections 0 
16:48:20  Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 7761, Llog used 0
16:48:51  Logical Log 54492 Complete, timestamp: 0xd41d1f30.
16:56:04  Logical Log 54493 Complete, timestamp: 0xd41dafde.
17:03:31  Checkpoint Completed:  duration was 0 seconds.
17:03:31  Mon Oct 22 - loguniq 54494, logpos 0x35f598, timestamp: 0xd41dff42 Interval: 399670


17:09:03  Booting Language <spl> from module <>
17:09:03  Loading Module <SPLNULL>
17:09:14  CM:Connection manager whics-t10 registered with the server
17:18:32  Checkpoint Completed:  duration was 0 seconds.
17:18:32  Mon Oct 22 - loguniq 54494, logpos 0x3a1018, timestamp: 
17:18:32  Maximum server connections 1 
17:18:32  Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 27, Llog used 0

17:33:32  Checkpoint Completed:  duration was 0 seconds.
17:33:32  Mon Oct 22 - loguniq 54494, logpos 0x3aa018, timestamp: 0xd41e0c7e Interval: 399672

The behaviour described is not the expected behaviour (not the behaviour I'd expect). I assume that there's no question of simple network timeouts being too long? I wonder if the network address that was used is still faintly responsive because although the VM has crashed (been powered off), but the host OS has not died? — Jonathan Leffler, Jul 23 '18 at 03:55
`DRAUTO 3` means that there should be a connection manager somewhere controlling the failover. Do you have a connection manager configured and it's logs ? — Luís Marques, Jul 23 '18 at 07:15
Did a connection manager setup with similar settings to what you wrote and it did failover to the secondary. However, it took over 60s from powering off the primary until the promotion of the secondary. — Luís Marques, Aug 02 '18 at 20:16

Luís Marques · Answer 1 · 2018-08-04T00:28:59.447

Did a lab setup and I will post some of the settings and logs so you can compare.

I have Informix server ifx_a as the primary on machine1 and Informix server ifx_b as the HDR secondary on machine2.

On the Informix servers ifx_a I have the following replication parameters. HA_ALIAS is equal to the DBSERVERNAME, to keep it simple). Informix server ifx_b parameters are the same, except HA_ALIAS and DBSERVERNAME.

DRAUTO                  3
DRINTERVAL              0
HDR_TXN_SCOPE           ASYNC
DRTIMEOUT               30
HA_ALIAS                ifx_a
HA_FOC_ORDER            SDS,HDR,RSS
DRLOSTFOUND             $INFORMIXDIR/etc/dr.lostfound
DRIDXAUTO               0
LOG_INDEX_BUILDS        1
SDS_ENABLE
SDS_TIMEOUT             20
SDS_TEMPDBS
SDS_PAGING
SDS_LOGCHECK            10
SDS_ALTERNATE           NONE
SDS_FLOW_CONTROL        0
UPDATABLE_SECONDARY     0
FAILOVER_CALLBACK
FAILOVER_TX_TIMEOUT     0
TEMPTAB_NOLOG           1
DELAY_APPLY             0
STOP_APPLY              0
LOG_STAGING_DIR         $INFORMIXDIR/tmp
RSS_FLOW_CONTROL        0
SMX_NUMPIPES            1
ENABLE_SNAPSHOT_COPY    0
SMX_COMPRESS            0
SMX_PING_INTERVAL       10
SMX_PING_RETRY          6
CLUSTER_TXN_SCOPE       SERVER

The sqlhosts file for both Informix servers is:

ifx_a    onsoctcp    machine1.local    15010    k=1
ifx_b    onsoctcp    machine2.local    15020    k=1

Replication is setup and working, on the secondary ifx_b I run:

$ onstat -g cluster

IBM Informix Dynamic Server Version 12.10.FC12 -- Read-Only (Sec) -- Up 00:05:08 -- 156276 Kbytes

Primary Server:ifx_a
Index page logging status: Enabled
Index page logging was enabled at: 2018/07/31 23:44:14


Server ACKed Log    Supports     Status
       (log, page)  Updates
ifx_b  16,6         No           ASYNC(HDR),Connected,On


$ onstat -g dri

IBM Informix Dynamic Server Version 12.10.FC12 -- Read-Only (Sec) -- Up 00:12:23 -- 156276 Kbytes

Data Replication at 0x45ac6028:
  Type           State        Paired server        Last DR CKPT (id/pg)    Supports Proxy Writes
  HDR Secondary  on           ifx_a                        16 / 5          N

  DRINTERVAL   0
  DRTIMEOUT    30
  DRAUTO       3
  DRLOSTFOUND  /opt/IBM/informix/12.10/etc/dr.lostfound
  DRIDXAUTO    0
  ENCRYPT_HDR  0
  Backlog      0
  Last Send    2018/08/01 00:12:12
  Last Receive 2018/08/01 00:12:12
  Last Ping    2018/08/01 00:12:03
  Last log page applied(log id,page): 0,0

On the primary ifx_a I run:

$ onstat -g cluster

IBM Informix Dynamic Server Version 12.10.FC12 -- On-Line (Prim) -- Up 00:21:30 -- 156276 Kbytes

Primary Server:ifx_a
Current Log Page:16,6
Index page logging status: Enabled
Index page logging was enabled at: 2018/07/31 23:44:14


Server ACKed Log    Applied Log  Supports     Status
       (log, page)  (log, page)  Updates
ifx_b  16,6         16,6         No           ASYNC(HDR),Connected,On


$ onstat -g dri

IBM Informix Dynamic Server Version 12.10.FC12 -- On-Line (Prim) -- Up 00:27:49 -- 156276 Kbytes

Data Replication at 0x45ac6028:
  Type           State        Paired server        Last DR CKPT (id/pg)    Supports Proxy Writes
  primary        on           ifx_b                        16 / 5          NA

  DRINTERVAL   0
  DRTIMEOUT    30
  DRAUTO       3
  DRLOSTFOUND  /opt/IBM/informix/12.10/etc/dr.lostfound
  DRIDXAUTO    0
  ENCRYPT_HDR  0
  Backlog      0
  Last Send    2018/08/01 00:11:53
  Last Receive 2018/08/01 00:11:53
  Last Ping    2018/08/01 00:11:32
  Last log page applied(log id,page): 16,6

So replication is working and the servers are synchronized.

On a third machine machine3 I setup the connection manager with the following parameters:

NAME            icm_1
LOG             1
LOGFILE         ${INFORMIXDIR}/tmp/icm_1.log
DEBUG           1

CM_TIMEOUT 60
EVENT_TIMEOUT 60
SECONDARY_EVENT_TIMEOUT 60
SQLHOSTS LOCAL

CLUSTER ifx_abc
{
    INFORMIXSERVER servers_abc

    SLA sla_ifx_abc    DBSERVERS=PRI
    FOC ORDER=ENABLED \
       PRIORITY=1
}

And the sqlhosts file for the connection manager:

servers_abc  group       -                 -        i=10,c=1,e=ifx_b
ifx_a        onsoctcp    machine1.local    15010    k=1,g=servers_abc
ifx_b        onsoctcp    machine2.local    15020    k=1,g=servers_abc

sla_ifx_abc  onsoctcp    machine3.local    15030    k=1

Now I forcefully bring down machine1 an I get the following on the online log of Informix server ifx_b:

08/02/18 01:00:26  Maximum server connections 1
08/02/18 01:00:26  Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 38, Llog used 0

08/02/18 01:06:39  The SMX connection between high availability servers was closed because the
 peer server was unresponsive for the timeout period (60 seconds times the
 number of retries).
08/02/18 01:06:39  The SMX connection between high availability servers was closed because the
 peer server was unresponsive for the timeout period (60 seconds times the
 number of retries).
08/02/18 01:07:28  DR: ping timeout
08/02/18 01:07:28  DR: Receive error
08/02/18 01:07:28  dr_secrcv thread : asfcode = -25582: oserr = 4: errstr = : Network connection is broken.
 System error = 4.
08/02/18 01:07:28  DR_ERR set to -1
08/02/18 01:07:28  DR: Receive Btree error
08/02/18 01:07:29  DR: Turned off on secondary server
08/02/18 01:07:36  SCHAPI: Issued Task() or Admin() command "task( 'ha make primary force', 'ifx_b' )".
08/02/18 01:07:37  Skipping failover callback.
08/02/18 01:07:48  Logical Recovery has reached the transaction cleanup phase.
08/02/18 01:07:48  Checkpoint Completed:  duration was 0 seconds.
08/02/18 01:07:48  Thu Aug  2 - loguniq 19, logpos 0xcb4018, timestamp: 0xf287b Interval: 148

08/02/18 01:07:48  Maximum server connections 2
08/02/18 01:07:48  Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 72, Llog used 1

08/02/18 01:07:48  Logical Recovery Complete.
          9001 Committed, 0 Rolled Back, 0 Open, 0 Bad Locks

08/02/18 01:07:48  Logical Recovery Complete.
08/02/18 01:07:48  Performance Advisory: Based on the current workload, the physical log might be too small to
accommodate the time it takes to flush the buffer pool.
08/02/18 01:07:48   Results: The server might block transactions during checkpoints.
08/02/18 01:07:48   Action: If transactions are blocked during the checkpoint, increase the size of the
 physical log to at least 716800 KB.
08/02/18 01:07:48  Performance Advisory: Based on the current workload, the logical log space might be too small to
accommodate the time it takes to flush the buffer pool.
08/02/18 01:07:48   Results: The server might block transactions during checkpoints.
08/02/18 01:07:48   Action: If transactions are blocked during the checkpoint, increase the size of the
 logical log space to at least 89600 KB.
08/02/18 01:07:48  Performance Advisory: The physical log is too small for automatic checkpoints.
08/02/18 01:07:48   Results: Automatic checkpoints are disabled.
08/02/18 01:07:48   Action: To enable automatic checkpoints, increase the physical log to at least 716800 KB.
08/02/18 01:07:48  Quiescent Mode
08/02/18 01:07:48  Checkpoint Completed:  duration was 0 seconds.
08/02/18 01:07:48  Thu Aug  2 - loguniq 19, logpos 0xcb6018, timestamp: 0xf288d Interval: 149

08/02/18 01:07:48  Maximum server connections 2
08/02/18 01:07:48  Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 4, Llog used 2

08/02/18 01:07:48  B-tree scanners enabled.
08/02/18 01:07:48  DR: Reservation of the last logical log for log backup turned on
08/02/18 01:07:48  DR: new type = primary, secondary server name = ifx_a
08/02/18 01:07:48  DR: Trying to connect to secondary server = ifx_a
08/02/18 01:07:48  Starting BldNotification
08/02/18 01:07:49  On-Line Mode
08/02/18 01:07:50  SCHAPI: Started dbScheduler thread.
08/02/18 01:07:50  Auto Registration is synced
08/02/18 01:07:50  SCHAPI: Started 2 dbWorker threads.
08/02/18 01:07:51  DR: Cannot connect to secondary server
08/02/18 01:07:51  DR: Turned off on primary server
08/02/18 01:07:51  DDR Log staging: Using the directory /opt/IBM/informix/12.10/tmp/ifmxddrlog_2.
08/02/18 01:07:51  SCHAPI: dbutil threads is already running.
08/02/18 01:07:51  SCHAPI: dbScheduler threads is already running.
08/02/18 01:07:52  Defragmenter cleaner thread now running
08/02/18 01:07:52  Defragmenter cleaner thread cleaned:0 partitions
08/02/18 01:08:12  DR: Cannot connect to secondary server
08/02/18 01:08:12  DR: Turned off on primary server
08/02/18 01:09:14  DR: Cannot connect to secondary server
08/02/18 01:09:14  DR: Turned off on primary server
08/02/18 01:10:14  DR: Cannot connect to secondary server
08/02/18 01:10:14  DR: Turned off on primary server
08/02/18 01:11:14  DR: Cannot connect to secondary server
08/02/18 01:11:14  DR: Turned off on primary server
08/02/18 01:12:14  DR: Cannot connect to secondary server
08/02/18 01:12:14  DR: Turned off on primary server
08/02/18 01:12:53  Checkpoint Completed:  duration was 0 seconds.
08/02/18 01:12:53  Thu Aug  2 - loguniq 19, logpos 0xcf0704, timestamp: 0xf3ab0 Interval: 150

08/02/18 01:12:53  Maximum server connections 2
08/02/18 01:12:53  Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 39, Llog used 58

08/02/18 01:13:14  DR: Cannot connect to secondary server
08/02/18 01:13:14  DR: Turned off on primary server

And on the connection manager log I get:

Thu Aug  2 00:57:42 2018
00:57:42 IBM Informix Connection Manager
00:57:42 IBM Informix CSDK Version 4.10, IBM Informix-ESQL Version 4.10.FC12
00:57:42 Build Number:  N037
00:57:42 Build Host:    hans
00:57:42 Build OS:      Linux 2.6.18-128.el5
00:57:42 Build Date:    Tue Jun 26 08:55:45 CDT 2018
00:57:42 GLS Version:   glslib-6.00.FC13
00:57:42
00:57:42 set CM_TIMEOUT 60
00:57:42 set global EVENT_TIMEOUT 60
00:57:42 set global SECONDARY_EVENT_TIMEOUT 60
00:57:42 DEBUG[TID139959615362880]:add type PRI to SLA sla_ifx_abc [cmsm_sla.c:parse_sla:2412]
00:57:42 SLA sla_ifx_abc listener mode REDIRECT
00:57:42 Connection Manager name is icm_1
00:57:42 Starting Connection Manager...
00:57:42 DEBUG[TID139959615362880]:Password File /opt/IBM/informix/4.10/etc/passwd_file failed error:No such file or directory[2] [cmsm_main.c:cmsm_pw_init:1782]
00:57:42 Warning: Password Manager failed; working in trusted node mode
00:57:42 set the maximum number of file descriptors 32767 failed
00:57:42 DEBUG[TID139959615362880]:setrlimit(RLIMIT_NOFILE, 32767) = -1, error:Operation not permitted[1] [cmsm_main.c:cmsm_run:4901]
00:57:42 the current maximum number of file descriptors is 1024
00:57:42 DEBUG[TID139959615362880]:new daemon pid is 11761 [cmsm_main.c:cmsm_daemonize:4653]
00:57:42 DEBUG[TID139959615362880]:dbservername = servers_abc [cmsm_server.ec:cmsm_primary:2825]
00:57:42 DEBUG[TID139959615362880]:nettype =  [cmsm_server.ec:cmsm_primary:2826]
00:57:42 DEBUG[TID139959615362880]:hostname = - [cmsm_server.ec:cmsm_primary:2827]
00:57:42 DEBUG[TID139959615362880]:servicename = - [cmsm_server.ec:cmsm_primary:2828]
00:57:42 DEBUG[TID139959615362880]:options = i=10,c=1,e=ifx_b [cmsm_server.ec:cmsm_primary:2829]
00:57:42 DEBUG[TID139959615362880]:connect to servers_abc for CLUSTER ifx_abc [cmsm_server.ec:cmsm_primary:2845]
00:57:42 DEBUG[TID139959615362880]:connect to group servers_abc server ifx_a for CLUSTER ifx_abc [cmsm_server.ec:cmsm_primary:2866]
00:57:42 DEBUG[TID139959615362880]:create new thread -509528320 for ifx_a [cmsm_sla.c:cmsm_update_server_ex:996]
00:57:42 DEBUG[TID139959615362880]:connect to group servers_abc server ifx_b for CLUSTER ifx_abc [cmsm_server.ec:cmsm_primary:2866]
00:57:42 DEBUG[TID139959615362880]:create new thread -511629568 for ifx_b [cmsm_sla.c:cmsm_update_server_ex:996]
00:57:42 listener sla_ifx_abc initializing
00:57:42 DEBUG[TID139959585543936]:listenter sla_ifx_abc host=machine3.local service=15030 nettype=soctcp engtypeon [cmsm_main.c:cmsm_listener_thread:1166]
00:57:42 DEBUG[TID139959589746432]:CONNECT to @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a1   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
00:57:42 DEBUG[TID139959589746432]:database sysmaster @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a1   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:145]
00:57:42 DEBUG[TID139959589746432]:ifx_a protocols = 1 [cmsm_server.ec:cmsm_add_dbaliases:2433]
00:57:42 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator setting primary name = ifx_a [cmsm_arb.c:arb_set_primary:793]
00:57:42 Cluster ifx_abc Arbitrator FOC ORDER=ENABLED        PRIORITY=1
00:57:42 Connection Manager successfully connected to ifx_a
00:57:42 DEBUG[TID139959587645184]:CONNECT to @ifx_b|onsoctcp|machine2.local|15020 AS ifx_b1   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
00:57:42 DEBUG[TID139959587645184]:database sysmaster @ifx_b|onsoctcp|machine2.local|15020 AS ifx_b1   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:145]
00:57:42 Listener sla_ifx_abc DBSERVERS=PRI is active with 4 worker threads
00:57:42 DEBUG[TID139959587645184]:ifx_b protocols = 1 [cmsm_server.ec:cmsm_add_dbaliases:2433]
00:57:42 Connection Manager successfully connected to ifx_b
00:57:42 DEBUG[TID139959589746432]:server ifx_a version is 12.10 [cmsm_er.ec:cmsm_er_monitor:2994]
00:57:42 DEBUG[TID139959589746432]:ACTIVE MONITOR ifx_a ifx_a 192.168.56.101 49464 [cmsm_er.ec:cmsm_er_monitor:3036]
00:57:42 DEBUG[TID139959589746432]:register CM successfully icm_1 with token 1782701540 [cmsm_server.ec:cmsm_cm_register:2336]
00:57:42 DEBUG[TID139959587645184]:server ifx_b version is 12.10 [cmsm_er.ec:cmsm_er_monitor:2994]
00:57:42 DEBUG[TID139959587645184]:ACTIVE MONITOR ifx_b ifx_b 192.168.56.101 44376 [cmsm_er.ec:cmsm_er_monitor:3036]
00:57:42 DEBUG[TID139959589746432]:Register server ifx_a for ifx_a count = 1 [cmsm_er.ec:cmsm_er_event_process:1281]
00:57:42 DEBUG[TID139959587645184]:register CM successfully icm_1 with token 1120237392 [cmsm_server.ec:cmsm_cm_register:2336]
00:57:42 DEBUG[TID139959587645184]:Register server ifx_b for ifx_b count = 1 [cmsm_er.ec:cmsm_er_event_process:1281]
00:57:43 DEBUG[TID139959615362880]:fcntl(/opt/IBM/informix/4.10/tmp/cmsm.pid.icm_1) success error:Success[0] [cmsm_main.c:cmsm_pidfile:2193]
00:57:43 Connection Manager started successfully
00:57:43 DEBUG[TID139959589746432]:SQL get ifx_a event SRV_ADM 16:5 SDS,HDR,RSS [cmsm_er.ec:cmsm_er_event_process:1883]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator reinitialized CM names [cmsm_arb.c:arb_clear_cms_list:655]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator added CM name = icm_1 [cmsm_arb.c:arb_add_cm:584]
00:57:43 CM icm_1 arbitrator for ifx_abc is active
00:57:43 Cluster ifx_abc Arbitrator FOC ORDER=SDS,HDR,RSS PRIORITY=1 TIMEOUT=1
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator reinitialized CM names [cmsm_arb.c:arb_clear_cms_list:655]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator added CM name = icm_1 [cmsm_arb.c:arb_add_cm:584]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator reinitialized CM names [cmsm_arb.c:arb_clear_cms_list:655]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator added CM name = icm_1 [cmsm_arb.c:arb_add_cm:584]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator reinitialized CM names [cmsm_arb.c:arb_clear_cms_list:655]
00:57:43 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator added CM name = icm_1 [cmsm_arb.c:arb_add_cm:584]
01:01:25 DEBUG[TID139959583442688]:SLA sla_ifx_abc 3 ifx_a time=1 latency= 0.00 readyQ=19 session=0 adjustSession=0 [cmsm_sla.c:select_server_from_sla:2947]
01:01:25 SLA sla_ifx_abc redirect SQLI client from 192.168.56.101 to ifx_a 192.168.56.102.15010
01:01:39 DEBUG[TID139959581341440]:SLA sla_ifx_abc 3 ifx_a time=0 latency= 0.00 readyQ=18 session=0 adjustSession=0 [cmsm_sla.c:select_server_from_sla:2947]
01:01:39 SLA sla_ifx_abc redirect SQLI client from 192.168.56.101 to ifx_a 192.168.56.102.15010
01:01:39 DEBUG[TID139959579240192]:SLA sla_ifx_abc 3 ifx_a time=0 latency= 0.00 readyQ=18 session=0 adjustSession=1 [cmsm_sla.c:select_server_from_sla:2947]
01:01:39 SLA sla_ifx_abc redirect SQLI client from 192.168.56.101 to ifx_a 192.168.56.102.15010
01:02:09 DEBUG[TID139959577138944]:SLA sla_ifx_abc 3 ifx_a time=6 latency= 0.00 readyQ=16 session=0 adjustSession=0 [cmsm_sla.c:select_server_from_sla:2947]
01:02:09 SLA sla_ifx_abc redirect SQLI client from 192.168.56.101 to ifx_a 192.168.56.102.15010
01:06:16 Server ifx_a no response over EVENT_TIMEOUT 60 seconds
01:06:16 Connection Manager disconnected from ifx_a
01:06:16 DEBUG[TID139959589746432]:start failover in sqlbreakcallback for ifx_a [cmsm_server.ec:cmsm_event_callback_common:1274]
01:06:16 DEBUG[TID139959575037696]:starting the failover process [cmsm_server.ec:cmsm_failover_incallback:1184]
01:06:17 Cluster ifx_abc Arbitrator detected primary server is down.
01:06:17 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:06:17 Cluster ifx_abc Arbitrator will not perform failover.
01:06:17 DEBUG[TID139959575037696]:failover process returns -1 [cmsm_server.ec:cmsm_failover_incallback:1207]
01:06:39 DEBUG[TID139959587645184]:create new thread -524237056 for ifx_a [cmsm_sla.c:cmsm_update_server_ex:996]
01:06:39 DEBUG[TID139959587645184]:create new thread -526338304 for ifx_a [cmsm_sla.c:cmsm_update_server_ex:996]
01:06:39 DEBUG[TID139959575037696]:Exit server ifx_a monitor machine3.local thread status=8 [cmsm_er.ec:cmsm_er_monitor:2608]
01:07:19 DEBUG[TID139959589746432]:fetch sysrepstats_cursor SQLCODE = (-213,0,) [cmsm_er.ec:cmsm_er_event_process:1711]
01:07:19 DEBUG[TID139959589746432]:unregister event failed, sqlcode = (-1811,0,) [cmsm_server.ec:cmsm_cm_unregister:2383]
01:07:19 DEBUG[TID139959589746432]:Unregister server ifx_a for ifx_a count = 0 [cmsm_er.ec:cmsm_er_event_process:2092]
01:07:19 Connection Manager disconnected from ifx_a
01:07:19 ALARM 3002 detected lost connection to Informix server ifx_a from machine3.local
01:07:19 DEBUG[TID139959589746432]:Connection Manager disconnected from ifx_a sqlcode=-1803 [cmsm_er.ec:cmsm_er_monitor:3090]
01:07:19 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:19 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:19 Cluster ifx_abc Arbitrator will not perform failover.
01:07:19 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:19 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:19 Cluster ifx_abc Arbitrator will not perform failover.
01:07:20 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:20 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:20 Cluster ifx_abc Arbitrator will not perform failover.
01:07:21 DEBUG[TID139959572936448]:CONNECT to @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a2   SQLCODE = (-908,107,ifx_a) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
01:07:21 ALARM 3001 unable to connect to Informix server ifx_a from machine3.local error -908
01:07:21 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:21 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:21 Cluster ifx_abc Arbitrator will not perform failover.
01:07:22 DEBUG[TID139959572936448]:Exit server ifx_a monitor machine3.local thread status=8 [cmsm_er.ec:cmsm_er_monitor:2608]
01:07:24 DEBUG[TID139959589746432]:CONNECT to @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a3   SQLCODE = (-908,107,ifx_a) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
01:07:24 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:24 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:24 Cluster ifx_abc Arbitrator will not perform failover.
01:07:27 DEBUG[TID139959589746432]:CONNECT to @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a4   SQLCODE = (-908,107,ifx_a) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
01:07:27 Cluster ifx_abc Arbitrator detected primary server is down.
01:07:27 Cluster ifx_abc Arbitrator detected majority of secondarys are still connected.
01:07:27 Cluster ifx_abc Arbitrator will not perform failover.
01:07:30 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:07:30 DEBUG[TID139959589746432]:CONNECT to @ifx_a|onsoctcp|machine1.local|15010 AS ifx_a5   SQLCODE = (-908,107,ifx_a) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
01:07:30 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator start failover logic now... [cmsm_arb.c:arb_server_down:1564]
01:07:30 ALARM 2001 failover arbitrator automated failover in progress.
01:07:30 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator try failover on generic SDS [cmsm_arb.c:arb_failover_by_foc:1174]
01:07:30 DEBUG[TID139959589746432]:Cluster ifx_abc Arbitrator try failover on generic HDR [cmsm_arb.c:arb_failover_by_foc:1200]
01:07:30 DEBUG[TID139959589746432]:CLuster ifx_abc Arbitrator failover to ifx_b, waiting 150 seconds for confirmation [cmsm_arb.c:arb_post_wait:1764]
01:07:30 DEBUG[TID139959589746432]:Arbitrator found failover node = ifx_b [cmsm_server_arb.ec:arb_failover_to:303]
01:07:37 DEBUG[TID139959589746432]:CONNECT to @ifx_b|onsoctcp|machine2.local|15020 AS ifx_b2   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:136]
01:07:37 DEBUG[TID139959589746432]:database sysmaster @ifx_b|onsoctcp|machine2.local|15020 AS ifx_b2   SQLCODE = (0,0,) [cmsm_server.ec:cmsm_connect_byurl_opt:145]
01:07:37 DEBUG[TID139959589746432]:Arbitrator connected to failover node = ifx_b [cmsm_server_arb.ec:arb_failover_to:316]
01:07:37 DEBUG[TID139959589746432]:Arbitrator get HA_ALIAS on ifx_b, HA_ALIAS [ifx_b] [cmsm_server_arb.ec:arb_failover_to:326]
01:07:50 DEBUG[TID139959587645184]:SQL get ifx_b event SRV_ADM 16:3 1 [cmsm_er.ec:cmsm_er_event_process:1883]
01:07:50 Server ifx_b is in quiescent mode.
01:07:50 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:07:50 DEBUG[TID139959587645184]:SQL get ifx_b event SRV_ADM 16:3 5 [cmsm_er.ec:cmsm_er_event_process:1883]
01:07:50 The server type of cluster ifx_abc server ifx_b is Primary.
01:07:50 DEBUG[TID139959587645184]:Cluster ifx_abc Arbitrator setting primary name = ifx_b [cmsm_arb.c:arb_set_primary:793]
01:07:50 Cluster ifx_abc Arbitrator FOC ORDER=SDS,HDR,RSS PRIORITY=1 TIMEOUT=1
01:07:50 Server ifx_b is in on-line mode.
01:07:50 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:5[NEWPRI] ifx_b [cmsm_er.ec:cmsm_er_event_process:1734]
01:07:50 Arbitrator make primary on node = ifx_b successful
01:07:50 ALARM 2002 failover arbitrator automated failover completed
01:07:50 DEBUG[TID139959589746432]:Monitor ifx_a exit by arbitrator  [cmsm_er.ec:cmsm_er_monitor:2789]
01:07:50 DEBUG[TID139959589746432]:all monitors exited, server ifx_a register count = 0 [cmsm_er.ec:cmsm_er_monitor:3156]
01:07:50 DEBUG[TID139959589746432]:enqueue a cleanup task for ifx_a [cmsm_er.ec:cmsm_enqueu_cleanup:2378]
01:07:50 DEBUG[TID139959587645184]:server monitor ifx_b dequeue a cleanup task  [cmsm_er.ec:cmsm_er_event_process:2060]
01:07:50 DEBUG[TID139959587645184]:SQL get ifx_b event SRV_ADM 16:5 SDS,HDR,RSS [cmsm_er.ec:cmsm_er_event_process:1883]
01:07:50 DEBUG[TID139959587645184]:Cluster ifx_abc Arbitrator reinitialized CM names [cmsm_arb.c:arb_clear_cms_list:655]
01:07:50 DEBUG[TID139959587645184]:Cluster ifx_abc Arbitrator added CM name = icm_1 [cmsm_arb.c:arb_add_cm:584]
01:07:57 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:08:13 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:09:19 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:10:17 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:11:22 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:12:20 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]
01:13:17 DEBUG[TID139959587645184]:get ifx_b CLUST_CHG event 1:6[DROFF] ifx_a [cmsm_er.ec:cmsm_er_event_process:1734]

So, there are differences, since it seems you had an AF (Assertion Failure) during the promotion of the secondary.

Could you post more of your logs? From right before pulling down the primary until the secondary gets stuck in recovery.

score 1 · Answer 2 · answered Nov 20 '18 at 23:34

Primary server assigns "active arbitrator" role to connection manager(CM) as there can be multiple connection managers. If primary server and "active arbitrator" CM gets killed at the same time then after restart, CM cannot promote HDR to primary since as CM isn't assigned the active arbitrator role. Informix do not support auto recovery from multi-point failure. CM needs to be run on a hardware different from primary server hardware.

score 1 · Answer 3 · answered Jan 31 '19 at 10:48

You need at least one Connection Manager up and running for a failover to happen since the role of the Connection Manager (beside others) is to detect a node failure and do a failover automatically in order to ensure the database high-availibility. If you have only one Connection Manager in front of the database cluster, after shutting it down, automatic failover won't happen so you will have to do

onmode -d make primary force

on secondary server in order to promote it to primary. If you have multiple properly configured connection managers, the failover is assured.

Informix HDR failover doesn't work when primary DB server and Connection Manager/ICM are terminated simultaneously then ICM is brought back online

3 Answers3