0

I have 2 server Asterisk on centOS in HA group with peacemaker.The cluster is in active passive mode with configured resource groups.The problem is from some time the server switch from the primary to secondary with no obvious reason for me.The log begins with:

PBX1 attrd[2503]:   notice: attrd_perform_update: Sent update 327: pingd=0

followed with stopping of the resource groups and

Oct  3 19:30:39 PBX1 Filesystem(DrbdFS)[14168]: INFO: Trying to unmount /drbd
Oct  3 19:30:39 PBX1 Filesystem(DrbdFS)[14168]: INFO: unmounted /drbd successfully
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: process_lrm_event: LRM operation DrbdFS_stop_0 (call=2046, rc=0, cib-update=2735, confirmed=true) ok
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 32: demote drbd_drbd0_demote_0 on pbx1(local)
Oct  3 19:30:39 PBX1 kernel: block drbd0: role( Primary -> Secondary ) 
Oct  3 19:30:39 PBX1 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies
Oct  3 19:30:39 PBX1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: process_lrm_event: LRM operation drbd_drbd0_demote_0 (call=2050, rc=0, cib-update=2736, confirmed=true) ok
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 159: notify drbd_drbd0_post_notify_demote_0 on pbx2
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 161: notify drbd_drbd0_post_notify_demote_0 on pbx1 (local)
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: process_lrm_event: LRM operation drbd_drbd0_notify_0 (call=2053, rc=0, cib-update=0, confirmed=true) ok
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 156: notify drbd_drbd0_pre_notify_stop_0 on pbx2
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 157: notify drbd_drbd0_pre_notify_stop_0 on pbx1 (local)
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: process_lrm_event: LRM operation drbd_drbd0_notify_0 (call=2056, rc=0, cib-update=0, confirmed=true) ok
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 31: stop drbd_drbd0_stop_0 on pbx2
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 33: stop drbd_drbd0_stop_0 on pbx1 (local)
Oct  3 19:30:39 PBX1 kernel: block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 
Oct  3 19:30:39 PBX1 kernel: block drbd0: asender terminated
Oct  3 19:30:39 PBX1 kernel: block drbd0: Terminating drbd0_asender
Oct  3 19:30:39 PBX1 kernel: block drbd0: Connection closed
Oct  3 19:30:39 PBX1 kernel: block drbd0: conn( Disconnecting -> StandAlone ) 
Oct  3 19:30:39 PBX1 kernel: block drbd0: receiver terminated
Oct  3 19:30:39 PBX1 kernel: block drbd0: Terminating drbd0_receiver
Oct  3 19:30:39 PBX1 kernel: block drbd0: disk( UpToDate -> Failed ) 
Oct  3 19:30:39 PBX1 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies
Oct  3 19:30:39 PBX1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Oct  3 19:30:39 PBX1 kernel: block drbd0: disk( Failed -> Diskless ) 
Oct  3 19:30:39 PBX1 kernel: block drbd0: drbd_bm_resize called with capacity == 0
Oct  3 19:30:39 PBX1 kernel: block drbd0: worker terminated
Oct  3 19:30:39 PBX1 kernel: block drbd0: Terminating drbd0_worker
Oct  3 19:30:39 PBX1 attrd[2503]:   notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_drbd0 (<null>)
Oct  3 19:30:39 PBX1 attrd[2503]:   notice: attrd_perform_update: Sent delete 329: node=pbx1, attr=master-drbd_drbd0, id=<n/a>, set=(null), section=status
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: process_lrm_event: LRM operation drbd_drbd0_stop_0 (call=2059, rc=0, cib-update=2737, confirmed=true) ok
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: run_graph: Transition 2112 (Complete=63, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3294.bz2): Stopped
Oct  3 19:30:39 PBX1 pengine[2504]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Oct  3 19:30:39 PBX1 pengine[2504]:   notice: process_pe_message: Calculated Transition 2113: /var/lib/pacemaker/pengine/pe-input-3295.bz2
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: run_graph: Transition 2113 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3295.bz2): Complete
Oct  3 19:30:39 PBX1 crmd[2505]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct  3 19:30:41 PBX1 attrd[2503]:   notice: attrd_perform_update: Sent update 333: pingd=1000
Oct  3 19:30:41 PBX1 crmd[2505]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct  3 19:30:41 PBX1 pengine[2504]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Oct  3 19:30:41 PBX1 pengine[2504]:   notice: LogActions: Start   drbd_drbd0:0#011(pbx2)
Oct  3 19:30:41 PBX1 pengine[2504]:   notice: process_pe_message: Calculated Transition 2114: /var/lib/pacemaker/pengine/pe-input-3296.bz2
Oct  3 19:30:41 PBX1 pengine[2504]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Oct  3 19:30:41 PBX1 pengine[2504]:   notice: LogActions: Start   drbd_drbd0:0#011(pbx1)
Oct  3 19:30:41 PBX1 pengine[2504]:   notice: LogActions: Start   drbd_drbd0:1#011(pbx2)
Oct  3 19:30:41 PBX1 pengine[2504]:   notice: process_pe_message: Calculated Transition 2115: /var/lib/pacemaker/pengine/pe-input-3297.bz2
Oct  3 19:30:41 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 7: start drbd_drbd0_start_0 on pbx1 (local)
Oct  3 19:30:41 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 9: start drbd_drbd0:1_start_0 on pbx2
Oct  3 19:30:41 PBX1 kernel: block drbd0: Starting worker thread (from cqueue [2046])
Oct  3 19:30:41 PBX1 kernel: block drbd0: disk( Diskless -> Attaching ) 
Oct  3 19:30:41 PBX1 kernel: block drbd0: Found 4 transactions (192 active extents) in activity log.
Oct  3 19:30:41 PBX1 kernel: block drbd0: Method to ensure write ordering: flush
Oct  3 19:30:41 PBX1 kernel: block drbd0: drbd_bm_resize called with capacity == 28411864
Oct  3 19:30:41 PBX1 kernel: block drbd0: resync bitmap: bits=3551483 words=55492 pages=109
Oct  3 19:30:41 PBX1 kernel: block drbd0: size = 14 GB (14205932 KB)
Oct  3 19:30:41 PBX1 kernel: block drbd0: bitmap READ of 109 pages took 13 jiffies
Oct  3 19:30:41 PBX1 kernel: block drbd0: recounting of set bits took additional 1 jiffies
Oct  3 19:30:41 PBX1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Oct  3 19:30:41 PBX1 kernel: block drbd0: disk( Attaching -> UpToDate ) 
Oct  3 19:30:41 PBX1 kernel: block drbd0: attached to UUIDs 760BE94B0D6CDF8A:0000000000000000:819257FD75678FC9:819157FD75678FC9
Oct  3 19:30:41 PBX1 kernel: block drbd0: conn( StandAlone -> Unconnected ) 
Oct  3 19:30:41 PBX1 kernel: block drbd0: Starting receiver thread (from drbd0_worker [14408])
Oct  3 19:30:41 PBX1 kernel: block drbd0: receiver (re)started
Oct  3 19:30:41 PBX1 kernel: block drbd0: conn( Unconnected -> WFConnection ) 
Oct  3 19:30:41 PBX1 attrd[2503]:   notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_drbd0 (1000)
Oct  3 19:30:41 PBX1 attrd[2503]:   notice: attrd_perform_update: Sent update 336: master-drbd_drbd0=1000
Oct  3 19:30:41 PBX1 attrd[2503]:   notice: attrd_perform_update: Sent update 338: master-drbd_drbd0=1000
Oct  3 19:30:41 PBX1 crmd[2505]:   notice: process_lrm_event: LRM operation drbd_drbd0_start_0 (call=2062, rc=0, cib-update=2741, confirmed=true) ok
Oct  3 19:30:41 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 106: notify drbd_drbd0_post_notify_start_0 on pbx1 (local)
Oct  3 19:30:41 PBX1 crmd[2505]:   notice: te_rsc_command: Initiating action 107: notify drbd_drbd0:1_post_notify_start_0 on pbx2
Oct  3 19:30:41 PBX1 crmd[2505]:   notice: process_lrm_event: LRM operation drbd_drbd0_notify_0 (call=2065, rc=0, cib-update=0, confirmed=true) ok
Oct  3 19:30:41 PBX1 crmd[2505]:   notice: run_graph: Transition 2115 (Complete=10, Pending=0, Fired=0, Skipped=2, Incompl  

the part for ping from pcs config is

     Attributes: drbd_resource=drbd0
   Operations: monitor interval=10s (drbd_drbd0-monitor-interval-10s)
 Clone: Connectivity
  Resource: p_ping (class=ocf provider=pacemaker type=ping)
   Attributes: host_list="10.66.9.4 10.66.9.5 10.66.9.11 10.66.9.252 10.66.9.253 10.66.9.254" multiplier=1000 dampen=5s
   Operations: monitor interval=1s (p_ping-monitor-interval-1s)

Important info is that only 10.66.9.11 is alive (i'am planning to clear the invalid addr. but this is other story) from those addresses because of the redesign witch took place before an year ago but the problem is seen from month or two max.Could it be from that or please say if you have any other idea. P.S Cables are good and their is no loss of icmp to 10.66.9.11 interface according to cacti witch configured for monitoring

Oct  3 19:18:31 PBX1 asterisk(asterisk)[1338]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:19:01 PBX1 asterisk(asterisk)[1960]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:19:14 PBX1 Xinetd(Xinetd)[2200]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct  3 19:19:32 PBX1 asterisk(asterisk)[2347]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:20:02 PBX1 asterisk(asterisk)[2994]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:20:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct  3 19:20:32 PBX1 asterisk(asterisk)[3369]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:21:02 PBX1 asterisk(asterisk)[3970]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:21:08 PBX1 crmd[2505]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Oct  3 19:21:08 PBX1 pengine[2504]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Oct  3 19:21:08 PBX1 pengine[2504]:   notice: process_pe_message: Calculated Transition 2110: /var/lib/pacemaker/pengine/pe-input-3292.bz2
Oct  3 19:21:08 PBX1 crmd[2505]:   notice: run_graph: Transition 2110 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3292.bz2): Complete
Oct  3 19:21:08 PBX1 crmd[2505]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct  3 19:21:14 PBX1 Xinetd(Xinetd)[4176]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct  3 19:21:32 PBX1 asterisk(asterisk)[4318]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:22:02 PBX1 asterisk(asterisk)[4989]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:22:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct  3 19:22:32 PBX1 asterisk(asterisk)[5352]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:23:02 PBX1 asterisk(asterisk)[5939]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:23:14 PBX1 Xinetd(Xinetd)[6160]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct  3 19:23:32 PBX1 asterisk(asterisk)[6302]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:24:02 PBX1 asterisk(asterisk)[6933]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:24:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct  3 19:24:32 PBX1 asterisk(asterisk)[7276]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:25:02 PBX1 asterisk(asterisk)[7895]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:25:14 PBX1 Xinetd(Xinetd)[8118]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct  3 19:25:32 PBX1 asterisk(asterisk)[8259]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns.isc.afilias-nst.info/A/IN': 2a01:8840:6::1#53
Oct  3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns.isc.afilias-nst.info/AAAA/IN': 2a01:8840:6::1#53
Oct  3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns1.isc.ultradns.net/A/IN': 2001:dcd:4::2#53
Oct  3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns2.isc.ultradns.net/A/IN': 2001:dcd:4::2#53
Oct  3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns1.isc.ultradns.net/AAAA/IN': 2001:dcd:4::2#53
Oct  3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns2.isc.ultradns.net/AAAA/IN': 2001:dcd:4::2#53
Oct  3 19:26:02 PBX1 asterisk(asterisk)[8873]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:26:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct  3 19:26:32 PBX1 asterisk(asterisk)[9233]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:27:02 PBX1 asterisk(asterisk)[9841]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:27:14 PBX1 Xinetd(Xinetd)[10088]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct  3 19:27:32 PBX1 asterisk(asterisk)[10230]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:28:02 PBX1 asterisk(asterisk)[10862]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:28:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct  3 19:28:32 PBX1 asterisk(asterisk)[11224]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:29:03 PBX1 asterisk(asterisk)[11810]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:29:14 PBX1 Xinetd(Xinetd)[12031]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct  3 19:29:33 PBX1 asterisk(asterisk)[12174]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:30:03 PBX1 asterisk(asterisk)[12818]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct  3 19:30:23 PBX1 attrd[2503]:   notice: attrd_perform_update: Sent update 327: pingd=0
Oct  3 19:30:23 PBX1 crmd[2505]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    drbd_drbd0:0#011(pbx2..bg)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: process_pe_message: Calculated Transition 2111: /var/lib/pacemaker/pengine/pe-input-3293.bz2
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    drbd_drbd0:0#011(pbx2..bg)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Demote  drbd_drbd0:1#011(Master -> Stopped pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    ClusPBXIP#011(pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    ClusNetIP#011(pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    ClusPBXIP-R#011(pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    ClusPBXIP-R2#011(pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    DSKPBXIP-R#011(pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    Fixr_ClusPBXIP#011(pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    DSKPBXIP-R5#011(pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    DSKPBXIP-R6#011(pbx1)
Oct  3 19:30:23 PBX1 pengine[2504]:   notice: LogActions: Stop    DSKPBXIP-R7#011(pbx1)
  • 1
    I suspect you'll need logs going back a little bit earlier to really determine what is going on. However, the logs you have are nearly unreadable because of the line wrapping. Could you get logs which aren't wrapped? Additionally, try grepping for 'error' in the logs (preferably ignore case i.e. `grep -i`). If you see no errors you may want to try feeding the pe-input files to crm_simulate to see the scores from the ping resource. For example `crm_simulate -Ss -x /var/lib/pacemaker/pengine/pe-input-3296.bz2`. – Dok Oct 04 '17 at 15:21
  • Hallo. There are no errors or warning (as i followed the basic troubleshooting guide for ha in internet).Here is the messages log sometimes before the event of switch. – aleksei stukov Oct 04 '17 at 17:38
  • Added in the body of the problem.I'll clear the formatting of the previous log. – aleksei stukov Oct 04 '17 at 17:45

1 Answers1

1

The PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 327: pingd=0 kind of gives it away. Here the pingd score was 0 (no pings). and then it shows it stops all the resources (Oct 3 19:30:23).

Later we see the pingd score hit 1000, and services then again restart: PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 333: pingd=1000 (Oct 3 19:30:41).

I don't have the full Pacemaker CIB, but I suspect you have some location/colocation rule saying that the resources cannot run where the pingd score equals 0. Thus when your pingd score hits zero resources stop cluster wide, then restart when the pings again return. While it is possible that the resources might start on the peer node, I don't believe this is an issue of failover, but rather services stopping and restarting due to the pingd resource and rules.

While you do mention that there is no issues with pings in cacti:

P.S Cables are good and their is no loss of icmp to 10.66.9.11 interface according to cacti witch configured for monitoring

How often is cacti monitoring this? Pacemaker will react if there is simply 5 seconds of pings not returning. Additionally, the cacti data is probably only pings from the cacti server to the 10.66.9.11 server. That doesn't necessarily mean there isn't an issue between the PBX cluster and the 10.66.9.11 system.

I would suggest either adding more ping nodes to the host_list, increasing the p_ping's monitor interval and dampen times, or just doing away with the p_ping resource and rules entirely.

Dok
  • 1,158
  • 1
  • 7
  • 13
  • 10x i'll add more nodes to ping and clear the old ones.Any sugestion witch is the best way to edit the xml because i know never to open it directly? i think using this way: `cibadmin --query > tmp.xml vi tmp.xml cibadmin --replace --xml-file tmp.xml` do you think its the best way? – aleksei stukov Oct 05 '17 at 07:27