[ClusterLabs] Virtual ip resource restarted on node with down network device

Mon Sep 19 10:18:52 UTC 2016

Ok, after reading the log files again I found 

Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Initiating action 3: stop mda-ip_stop_0 on MDA1PFP-PCS01 (local)
Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: MDA1PFP-PCS01-mda-ip_monitor_1000:14 [ ocf-exit-reason:Unknown interface [bond0] No such device.\n ]
Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface [bond0] No such device.
Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed
Sep 19 10:03:45 MDA1PFP-S01 lrmd[7794]:  notice: mda-ip_stop_0:8745:stderr [ ocf-exit-reason:Unknown interface [bond0] No such device. ]
Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Operation mda-ip_stop_0: ok (node=MDA1PFP-PCS01, call=16, rc=0, cib-update=49, confirmed=true)
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: Transition 3 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-501.bz2): Complete
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:  notice: On loss of CCM Quorum: Ignore
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: warning: Processing failed op monitor for mda-ip on MDA1PFP-PCS01: not configured (6)
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:   error: Preventing mda-ip from re-starting anywhere: operation monitor failed 'not configured' (6)

I think that explains why the resource is not started on the other node, but I am not sure this is a good decision. It seems to be a little harsh to prevent the resource from starting anywhere, especially considering that the other node will be able to start the resource. 

Cheers,
  Jens
--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.auer at cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI Group Inc. and its affiliates may be contained in this message. If you are not a recipient indicated or intended in this message (or responsible for delivery of this message to such person), or you think for any reason that this message may have been addressed to you in error, you may not use or copy or deliver this message to anyone else. In such case, you should destroy this message and are asked to notify the sender by reply e-mail.

________________________________________
Von: Auer, Jens
Gesendet: Montag, 19. September 2016 12:08
An: Cluster Labs - All topics related to open-source clustering welcomed
Betreff: AW: [ClusterLabs] Virtual ip resource restarted on node with down network device

Hi,

> Would "rmmod <interface-driver-module>" be a better hammer of choice?

I am just testing what happens in case of hardware/network issues. Any hammer is good enough. Worst case would be that I unplug the machine, maybe with ILO.

I have created a simple testing setup of a two-node cluter with a virtual ip and a ping resource which should move to the other node when I unload the drivers on the active node. The configuration is
MDA1PFP-S02 10:02:53 1203 0 ~ # pcs cluster setup --name MDA1PFP MDA1PFP-PCS01,MDA1PFP-S01 MDA1PFP-PCS02,MDA1PFP-S02
Shutting down pacemaker/corosync services...
Redirecting to /bin/systemctl stop  pacemaker.service
Redirecting to /bin/systemctl stop  corosync.service
Killing any remaining services...
Removing all cluster configuration files...
MDA1PFP-PCS01: Succeeded
MDA1PFP-PCS02: Succeeded
Synchronizing pcsd certificates on nodes MDA1PFP-PCS01, MDA1PFP-PCS02...
MDA1PFP-PCS01: Success
MDA1PFP-PCS02: Success

Restaring pcsd on the nodes in order to reload the certificates...
MDA1PFP-PCS01: Success
MDA1PFP-PCS02: Success
MDA1PFP-S02 10:03:02 1204 0 ~ # pcs cluster start --all
MDA1PFP-PCS01: Starting Cluster...
MDA1PFP-PCS02: Starting Cluster...
MDA1PFP-S02 10:03:03 1205 0 ~ # sleep 5
rm -f mda; pcs cluster cib mda
pcs -f mda property set no-quorum-policy=ignore

pcs -f mda resource create mda-ip ocf:heartbeat:IPaddr2 ip=192.168.120.20 cidr_netmask=32 nic=bond0 op monitor interval=1s
MDA1PFP-S02 10:03:08 1206 0 ~ # crm_attribute --type nodes --node MDA1PFP-PCS01 --name ServerRole --update PRIME
MDA1PFP-S02 10:03:08 1207 0 ~ # crm_attribute --type nodes --node MDA1PFP-PCS02 --name ServerRole --update BACKUP
MDA1PFP-S02 10:03:08 1208 0 ~ # pcs property set stonith-enabled=false
MDA1PFP-S02 10:03:08 1209 0 ~ # rm -f mda; pcs cluster cib mda
MDA1PFP-S02 10:03:08 1210 0 ~ # pcs -f mda property set no-quorum-policy=ignore
MDA1PFP-S02 10:03:08 1211 0 ~ #
MDA1PFP-S02 10:03:08 1211 0 ~ # pcs -f mda resource create mda-ip ocf:heartbeat:IPaddr2 ip=192.168.120.20 cidr_netmask=32 nic=bond0 op monitor interval=1s
MDA1PFP-S02 10:03:08 1212 0 ~ # pcs -f mda resource create ping ocf:pacemaker:ping dampen=5s multiplier=1000 host_list=pf-pep-dev-1  params timeout=1 attempts=3  op monitor interval=1 --clone
MDA1PFP-S02 10:03:12 1213 0 ~ # pcs -f mda constraint location mda-ip rule score=-INFINITY pingd lt 1 or not_defined pingd
MDA1PFP-S02 10:03:12 1214 0 ~ # pcs cluster cib-push mda
CIB updated

When I now unload the drivers on the active node the VIP resource is stopped but never started on the other node although it can ping.

MDA1PFP-S01 10:02:49 2162 0 ~ # modprobe -r bonding; modprobe -r ixgbe
MDA1PFP-S01 10:03:45 2163 0 ~ # pcs status
Cluster name: MDA1PFP
Last updated: Mon Sep 19 10:04:38 2016          Last change: Mon Sep 19 10:03:25 2016 by hacluster via crmd on MDA1PFP-PCS01
Stack: corosync
Current DC: MDA1PFP-PCS01 (version 1.1.13-10.el7-44eb2dd) - partition with quorum
2 nodes and 3 resources configured

Online: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]

Full list of resources:

 mda-ip (ocf::heartbeat:IPaddr2):       Stopped
 Clone Set: ping-clone [ping]
     Started: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]

Failed Actions:
* mda-ip_monitor_1000 on MDA1PFP-PCS01 'not configured' (6): call=14, status=complete, exitreason='Unknown interface [bond0] No such device.',
    last-rc-change='Mon Sep 19 10:03:45 2016', queued=0ms, exec=0ms

PCSD Status:
  MDA1PFP-PCS01: Online
  MDA1PFP-PCS02: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

The log from the otehr node to which the resource should be migrated is:
Sep 19 10:03:12 MDA1PFP-S02 pcsd: Starting pcsd:
Sep 19 10:03:12 MDA1PFP-S02 systemd: Starting PCS GUI and remote configuration interface...
Sep 19 10:03:12 MDA1PFP-S02 systemd: Started PCS GUI and remote configuration interface.
Sep 19 10:03:15 MDA1PFP-S02 attrd[12444]:  notice: Updating all attributes after cib_refresh_notify event
Sep 19 10:03:15 MDA1PFP-S02 crmd[12446]:  notice: Notifications disabled
Sep 19 10:03:25 MDA1PFP-S02 crmd[12446]: warning: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Sep 19 10:03:25 MDA1PFP-S02 crmd[12446]:  notice: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Sep 19 10:03:25 MDA1PFP-S02 crmd[12446]:  notice: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Sep 19 10:03:25 MDA1PFP-S02 attrd[12444]:  notice: Processing sync-response from MDA1PFP-PCS01
Sep 19 10:03:26 MDA1PFP-S02 crmd[12446]:  notice: Operation ping_monitor_0: not running (node=MDA1PFP-PCS02, call=10, rc=7, cib-update=13, confirmed=true)
Sep 19 10:03:26 MDA1PFP-S02 crmd[12446]:  notice: Operation mda-ip_monitor_0: not running (node=MDA1PFP-PCS02, call=5, rc=7, cib-update=14, confirmed=true)
Sep 19 10:03:28 MDA1PFP-S02 crmd[12446]:  notice: Operation ping_start_0: ok (node=MDA1PFP-PCS02, call=11, rc=0, cib-update=15, confirmed=true)
Sep 19 10:03:48 MDA1PFP-S02 corosync[12425]: [TOTEM ] Marking ringid 1 interface 192.168.120.11 FAULTY

On the node initial active node hosting the VIP the log is
Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing address record for fe80::5eb9:1ff:fe9c:e7fc on bond0.
Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing address record for 192.168.120.20 on bond0.
Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing address record for 192.168.120.10 on bond0.
Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing workstation service for bond0.
Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info>  (bond0): bond slave eno49 was released
Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info>  (eno49): released from master bond0
Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info>  (bond0): bond slave eno50 was released
Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info>  (eno50): released from master bond0
Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info>  (eno50): link disconnected
Sep 19 10:03:45 MDA1PFP-S01 gnome-session: Gjs-Message: JS LOG: Removing a network device that was not added
Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing workstation service for eno50.
Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <warn>  (eno50): failed to disable userspace IPv6LL address handling
Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing workstation service for eno49.
Sep 19 10:03:45 MDA1PFP-S01 kernel: ixgbe 0000:04:00.1: complete
Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info>  (eno49): device state change: disconnected -> unmanaged (reason 'removed') [30 10 36]
Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <warn>  (eno49): failed to disable userspace IPv6LL address handling
Sep 19 10:03:45 MDA1PFP-S01 kernel: ixgbe 0000:04:00.0: complete
Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8714]: ERROR: Unknown interface [bond0] No such device.
Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8714]: ERROR: [findif] failed
Sep 19 10:03:45 MDA1PFP-S01 lrmd[7794]:  notice: mda-ip_monitor_1000:8714:stderr [ ocf-exit-reason:Unknown interface [bond0] No such device. ]
Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: MDA1PFP-PCS01-mda-ip_monitor_1000:14 [ ocf-exit-reason:Unknown interface [bond0] No such device.\n ]
Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Sep 19 10:03:45 MDA1PFP-S01 pengine[7796]:  notice: On loss of CCM Quorum: Ignore
Sep 19 10:03:45 MDA1PFP-S01 pengine[7796]: warning: Processing failed op monitor for mda-ip on MDA1PFP-PCS01: not configured (6)
Sep 19 10:03:45 MDA1PFP-S01 pengine[7796]:   error: Preventing mda-ip from re-starting anywhere: operation monitor failed 'not configured' (6)
Sep 19 10:03:45 MDA1PFP-S01 pengine[7796]:  notice: Stop    mda-ip      (MDA1PFP-PCS01)
Sep 19 10:03:45 MDA1PFP-S01 pengine[7796]:  notice: Calculated Transition 3: /var/lib/pacemaker/pengine/pe-input-501.bz2
Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Initiating action 3: stop mda-ip_stop_0 on MDA1PFP-PCS01 (local)
Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: MDA1PFP-PCS01-mda-ip_monitor_1000:14 [ ocf-exit-reason:Unknown interface [bond0] No such device.\n ]
Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface [bond0] No such device.
Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed
Sep 19 10:03:45 MDA1PFP-S01 lrmd[7794]:  notice: mda-ip_stop_0:8745:stderr [ ocf-exit-reason:Unknown interface [bond0] No such device. ]
Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Operation mda-ip_stop_0: ok (node=MDA1PFP-PCS01, call=16, rc=0, cib-update=49, confirmed=true)
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: Transition 3 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-501.bz2): Complete
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:  notice: On loss of CCM Quorum: Ignore
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: warning: Processing failed op monitor for mda-ip on MDA1PFP-PCS01: not configured (6)
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:   error: Preventing mda-ip from re-starting anywhere: operation monitor failed 'not configured' (6)
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:  notice: Calculated Transition 4: /var/lib/pacemaker/pengine/pe-input-502.bz2
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: Transition 4 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-502.bz2): Complete
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Sep 19 10:03:46 MDA1PFP-S01 ntpd[24456]: Deleting interface #21 bond0, 192.168.120.20#123, interface stats: received=0, sent=0, dropped=0, active_time=12 secs
Sep 19 10:03:46 MDA1PFP-S01 ntpd[24456]: Deleting interface #19 bond0, fe80::5eb9:1ff:fe9c:e7fc#123, interface stats: received=0, sent=0, dropped=0, active_time=218 secs
Sep 19 10:03:46 MDA1PFP-S01 ntpd[24456]: Deleting interface #18 bond0, 192.168.120.10#123, interface stats: received=0, sent=0, dropped=0, active_time=218 secs
Sep 19 10:03:47 MDA1PFP-S01 corosync[7776]: [TOTEM ] Retransmit List: a0
Sep 19 10:03:47 MDA1PFP-S01 corosync[7776]: [TOTEM ] Retransmit List: a3 a5
Sep 19 10:03:47 MDA1PFP-S01 corosync[7776]: [TOTEM ] Retransmit List: a5 a7
Sep 19 10:03:47 MDA1PFP-S01 corosync[7776]: [TOTEM ] Retransmit List: a5
Sep 19 10:03:48 MDA1PFP-S01 corosync[7776]: [TOTEM ] Marking ringid 1 interface 192.168.120.10 FAULTY
Sep 19 10:03:54 MDA1PFP-S01 crmd[7797]:  notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Sep 19 10:03:54 MDA1PFP-S01 pengine[7796]:  notice: On loss of CCM Quorum: Ignore
Sep 19 10:03:54 MDA1PFP-S01 pengine[7796]: warning: Processing failed op monitor for mda-ip on MDA1PFP-PCS01: not configured (6)
Sep 19 10:03:54 MDA1PFP-S01 pengine[7796]:   error: Preventing mda-ip from re-starting anywhere: operation monitor failed 'not configured' (6)
Sep 19 10:03:54 MDA1PFP-S01 pengine[7796]:  notice: Calculated Transition 5: /var/lib/pacemaker/pengine/pe-input-503.bz2
Sep 19 10:03:54 MDA1PFP-S01 crmd[7797]:  notice: Transition 5 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-503.bz2): Complete
Sep 19 10:03:54 MDA1PFP-S01 crmd[7797]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]

Best wishes,
  Jens

--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.auer at cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI Group Inc. and its affiliates may be contained in this message. If you are not a recipient indicated or intended in this message (or responsible for delivery of this message to such person), or you think for any reason that this message may have been addressed to you in error, you may not use or copy or deliver this message to anyone else. In such case, you should destroy this message and are asked to notify the sender by reply e-mail.