[ClusterLabs] Weird Fencing Behavior

Wed Jul 18 01:21:51 UTC 2018

> > Hi,
> >
> > On my two-node active/passive setup, I configured fencing via
> > fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I
> expected
> > that both nodes will be stonithed simultaenously.
> >
> > On my test scenario, Node1 has ClusterIP resource. When I
> disconnect
> > service/corosync link physically, Node1 was fenced and Node2 keeps
> alive
> > given pcmk_delay=0 on both nodes.
> >
> > Can you explain the behavior above?
> >
>
> #node1 could not connect to ESX because links were disconnected. As
> the
> #most obvious explanation.
>
> #You have logs, you are the only one who can answer this question
> with
> #some certainty. Others can only guess.
>
>
> Oops, my bad. I forgot to tell. I have two interfaces on each virtual
> machine (nodes). second interface was used for ESX links, so fence
> can be executed even though corosync links were disconnected. Looking
> forward to your response. Thanks

#Having no fence delay means a death match (each node killing the other)
#is possible, but it doesn't guarantee that it will happen. Some of the
#time, one node will detect the outage and fence the other one before
#the other one can react.

#It's basically an Old West shoot-out -- they may reach for their guns
#at the same time, but one may be quicker.

#As Andrei suggested, the logs from both nodes could give you a timeline
#of what happened when.

Hi andrei, kindly see below logs. Based on time of logs, Node1 should have
fenced first Node2, but in actual test/scenario, Node1 was fenced/shutdown
by Node2.

Is it possible to have a 2-Node active/passive setup in pacemaker/corosync
that the node that gets disconnected/interface down is the only one that
gets fenced?

Thanks guys

*LOGS from Node2:*

Jul 17 13:33:27 ArcosRhel2 corosync[1048]: [TOTEM ] A processor failed,
forming new configuration.
Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [TOTEM ] A new membership (
172.16.10.242:220) was formed. Members left: 1
Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [TOTEM ] Failed to receive the
leave message. failed: 1
Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [QUORUM] Members[1]: 2
Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [MAIN  ] Completed service
synchronization, ready to provide service.
Jul 17 13:33:28 ArcosRhel2 attrd[1082]:  notice: Node ArcosRhel1 state is
now lost
Jul 17 13:33:28 ArcosRhel2 attrd[1082]:  notice: Removing all ArcosRhel1
attributes for peer loss
Jul 17 13:33:28 ArcosRhel2 attrd[1082]:  notice: Lost attribute writer
ArcosRhel1
Jul 17 13:33:28 ArcosRhel2 attrd[1082]:  notice: Purged 1 peers with id=1
and/or uname=ArcosRhel1 from the membership cache
Jul 17 13:33:28 ArcosRhel2 cib[1079]:  notice: Node ArcosRhel1 state is now
lost
Jul 17 13:33:28 ArcosRhel2 cib[1079]:  notice: Purged 1 peers with id=1
and/or uname=ArcosRhel1 from the membership cache
Jul 17 13:33:28 ArcosRhel2 crmd[1084]:  notice: Node ArcosRhel1 state is
now lost
Jul 17 13:33:28 ArcosRhel2 crmd[1084]: warning: Our DC node (ArcosRhel1)
left the cluster
Jul 17 13:33:28 ArcosRhel2 pacemakerd[1074]:  notice: Node ArcosRhel1 state
is now lost
Jul 17 13:33:28 ArcosRhel2 stonith-ng[1080]:  notice: Node ArcosRhel1 state
is now lost
Jul 17 13:33:28 ArcosRhel2 stonith-ng[1080]:  notice: Purged 1 peers with
id=1 and/or uname=ArcosRhel1 from the membership cache
Jul 17 13:33:28 ArcosRhel2 crmd[1084]:  notice: State transition S_NOT_DC
-> S_ELECTION
Jul 17 13:33:28 ArcosRhel2 crmd[1084]:  notice: State transition S_ELECTION
-> S_INTEGRATION
Jul 17 13:33:28 ArcosRhel2 crmd[1084]: warning: Input I_ELECTION_DC
received in state S_INTEGRATION from do_election_check
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 will be
fenced because the node is no longer part of the cluster
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 is
unclean
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Action fence2_stop_0 on
ArcosRhel1 is unrunnable (offline)
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Action ClusterIP_stop_0
on ArcosRhel1 is unrunnable (offline)
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Scheduling Node
ArcosRhel1 for STONITH
Jul 17 13:33:30 ArcosRhel2 pengine[1083]:  notice: Move
 fence2#011(Started ArcosRhel1 -> ArcosRhel2)
Jul 17 13:33:30 ArcosRhel2 pengine[1083]:  notice: Move
 ClusterIP#011(Started ArcosRhel1 -> ArcosRhel2)
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Calculated transition 0
(with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-20.bz2
Jul 17 13:33:30 ArcosRhel2 crmd[1084]:  notice: Requesting fencing (reboot)
of node ArcosRhel1
Jul 17 13:33:30 ArcosRhel2 crmd[1084]:  notice: Initiating start operation
fence2_start_0 locally on ArcosRhel2
Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]:  notice: Client
crmd.1084.cd70178e wants to fence (reboot) 'ArcosRhel1' with device '(any)'
Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]:  notice: Requesting peer
fencing (reboot) of ArcosRhel1
Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]:  notice: Fence1 can fence
(reboot) ArcosRhel1: static-list
Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]:  notice: fence2 can not fence
(reboot) ArcosRhel1: static-list
Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]:  notice: Fence1 can fence
(reboot) ArcosRhel1: static-list
Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]:  notice: fence2 can not fence
(reboot) ArcosRhel1: static-list
Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]: warning: fence2 has 'action'
parameter, which should never be specified in configuration
Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]: warning: Mapping action='off'
to pcmk_reboot_action='off'
Jul 17 13:33:49 ArcosRhel2 crmd[1084]:  notice: Result of start operation
for fence2 on ArcosRhel2: 0 (ok)
Jul 17 13:33:49 ArcosRhel2 crmd[1084]:  notice: Initiating monitor
operation fence2_monitor_60000 locally on ArcosRhel2
Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]:  notice: Operation 'reboot'
[2323] (call 2 from crmd.1084) for host 'ArcosRhel1' with device 'Fence1'
returned: 0 (OK)
Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]:  notice: Operation reboot of
ArcosRhel1 by ArcosRhel2 for crmd.1084 at ArcosRhel2.0426e6e1: OK
Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Stonith operation
2/12:0:0:f9418e1f-1f13-4033-9eaa-aec705f807ef: OK (0)
Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Peer ArcosRhel1 was
terminated (reboot) by ArcosRhel2 for ArcosRhel2: OK
(ref=0426e6e1-cfda-4475-b32d-8f7bce17027b) by client crmd.1084
Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Initiating start operation
ClusterIP_start_0 locally on ArcosRhel2
Jul 17 13:33:50 ArcosRhel2 IPaddr2(ClusterIP)[2342]: INFO: Adding inet
address 172.16.10.243/32 with broadcast address 172.16.10.255 to device
ens192
Jul 17 13:33:51 ArcosRhel2 IPaddr2(ClusterIP)[2342]: INFO: Bringing device
ens192 up
Jul 17 13:33:51 ArcosRhel2 IPaddr2(ClusterIP)[2342]: INFO:
/usr/libexec/heartbeat/send_arp -i 200 -c 5 -p
/var/run/resource-agents/send_arp-172.16.10.243 -I ens192 -m auto
172.16.10.243
Jul 17 13:33:52 ArcosRhel2 ntpd[1821]: Listen normally on 8 ens192
172.16.10.243 UDP 123
Jul 17 13:33:55 ArcosRhel2 crmd[1084]:  notice: Result of start operation
for ClusterIP on ArcosRhel2: 0 (ok)
Jul 17 13:33:58 ArcosRhel2 crmd[1084]:  notice: Initiating monitor
operation ClusterIP_monitor_30000 locally on ArcosRhel2
Jul 17 13:33:58 ArcosRhel2 crmd[1084]:  notice: Transition 0 (Complete=9,
Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-warn-20.bz2): Complete
Jul 17 13:33:58 ArcosRhel2 crmd[1084]:  notice: State transition
S_TRANSITION_ENGINE -> S_IDLE
Jul 17 13:34:43 ArcosRhel2 ntpd[1821]: 0.0.0.0 0612 02 freq_set kernel
-40.734 PPM
Jul 17 13:34:43 ArcosRhel2 ntpd[1821]: 0.0.0.0 0615 05 clock_sync

*LOGS from NODE1*
Jul 17 13:33:26 ArcoSRhel1 corosync[1464]: [TOTEM ] A processor failed,
forming new configuration.
Jul 17 13:33:28 ArcoSRhel1 corosync[1464]: [TOTEM ] A new membership (
172.16.10.241:220) was formed. Members left: 2
Jul 17 13:33:28 ArcoSRhel1 corosync[1464]: [TOTEM ] Failed to receive the
leave message. failed: 2
Jul 17 13:33:28 ArcoSRhel1 corosync[1464]: [QUORUM] Members[1]: 1
Jul 17 13:33:28 ArcoSRhel1 corosync[1464]: [MAIN  ] Completed service
synchronization, ready to provide service.
Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Node ArcosRhel2 state
is now lost
Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Purged 1 peers with
id=2 and/or uname=ArcosRhel2 from the membership cache
Jul 17 13:33:28 ArcoSRhel1 attrd[1475]:  notice: Node ArcosRhel2 state is
now lost
Jul 17 13:33:28 ArcoSRhel1 attrd[1475]:  notice: Removing all ArcosRhel2
attributes for peer loss
Jul 17 13:33:28 ArcoSRhel1 attrd[1475]:  notice: Purged 1 peers with id=2
and/or uname=ArcosRhel2 from the membership cache
Jul 17 13:33:28 ArcoSRhel1 cib[1472]:  notice: Node ArcosRhel2 state is now
lost
Jul 17 13:33:28 ArcoSRhel1 cib[1472]:  notice: Purged 1 peers with id=2
and/or uname=ArcosRhel2 from the membership cache
Jul 17 13:33:28 ArcoSRhel1 crmd[1477]:  notice: Node ArcosRhel2 state is
now lost
Jul 17 13:33:28 ArcoSRhel1 crmd[1477]: warning: No reason to expect node 2
to be down
Jul 17 13:33:28 ArcoSRhel1 crmd[1477]:  notice: Stonith/shutdown of
ArcosRhel2 not matched
Jul 17 13:33:28 ArcoSRhel1 pacemakerd[1471]:  notice: Node ArcosRhel2 state
is now lost
Jul 17 13:33:28 ArcoSRhel1 crmd[1477]:  notice: State transition S_IDLE ->
S_POLICY_ENGINE
Jul 17 13:33:28 ArcoSRhel1 crmd[1477]: warning: No reason to expect node 2
to be down
Jul 17 13:33:28 ArcoSRhel1 crmd[1477]:  notice: Stonith/shutdown of
ArcosRhel2 not matched
Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Node ArcosRhel2 will be
fenced because the node is no longer part of the cluster
Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Node ArcosRhel2 is
unclean
Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Action Fence1_stop_0 on
ArcosRhel2 is unrunnable (offline)
Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Scheduling Node
ArcosRhel2 for STONITH
Jul 17 13:33:28 ArcoSRhel1 pengine[1476]:  notice: Move
 Fence1#011(Started ArcosRhel2 -> ArcosRhel1)
Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Calculated transition 4
(with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-8.bz2
Jul 17 13:33:28 ArcoSRhel1 crmd[1477]:  notice: Requesting fencing (reboot)
of node ArcosRhel2
Jul 17 13:33:28 ArcoSRhel1 crmd[1477]:  notice: Initiating start operation
Fence1_start_0 locally on ArcosRhel1
Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Client
crmd.1477.6d888347 wants to fence (reboot) 'ArcosRhel2' with device '(any)'
Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Requesting peer
fencing (reboot) of ArcosRhel2
Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: warning: Fence1 has 'action'
parameter, which should never be specified in configuration
Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: warning: Mapping action='off'
to pcmk_reboot_action='off'
Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not fence
(reboot) ArcosRhel2: static-list
Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
(reboot) ArcosRhel2: static-list
Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not fence
(reboot) ArcosRhel2: static-list
Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
(reboot) ArcosRhel2: static-list
Jul 17 13:33:46 ArcoSRhel1 fence_vmware_soap: Unable to connect/login to
fencing device
Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
fence_vmware_soap[7157] stderr: [ Unable to connect/login to fencing device
]
Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
fence_vmware_soap[7157] stderr: [  ]
Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
fence_vmware_soap[7157] stderr: [  ]

> > See my config below:
> >
> > [root at ArcosRhel2 cluster]# pcs config
> > Cluster Name: ARCOSCLUSTER
> > Corosync Nodes:
> >? ArcosRhel1 ArcosRhel2
> > Pacemaker Nodes:
> >? ArcosRhel1 ArcosRhel2
> >
> > Resources:
> >? Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
> >? ?Attributes: cidr_netmask=32 ip=172.16.10.243
> >? ?Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
> >? ? ? ? ? ? ? ?start interval=0s timeout=20s (ClusterIP-start-
> interval-0s)
> >? ? ? ? ? ? ? ?stop interval=0s timeout=20s (ClusterIP-stop-
> interval-0s)
> >
> > Stonith Devices:
> >? Resource: Fence1 (class=stonith type=fence_vmware_soap)
> >? ?Attributes: action=off ipaddr=172.16.10.151 login=admin
> passwd=123pass
> > pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s
> port=ArcosRhel1(Joniel)
> > ssl_insecure=1 pcmk_delay_max=0s
> >? ?Operations: monitor interval=60s (Fence1-monitor-interval-60s)
> >? Resource: fence2 (class=stonith type=fence_vmware_soap)
> >? ?Attributes: action=off ipaddr=172.16.10.152 login=admin
> passwd=123pass
> > pcmk_delay_max=0s pcmk_host_list=ArcosRhel2
> pcmk_monitor_timeout=60s
> > port=ArcosRhel2(Ben) ssl_insecure=1
> >? ?Operations: monitor interval=60s (fence2-monitor-interval-60s)
> > Fencing Levels:
> >
> > Location Constraints:
> >? ?Resource: Fence1
> >? ? ?Enabled on: ArcosRhel2 (score:INFINITY)
> > (id:location-Fence1-ArcosRhel2-INFINITY)
> >? ?Resource: fence2
> >? ? ?Enabled on: ArcosRhel1 (score:INFINITY)
> > (id:location-fence2-ArcosRhel1-INFINITY)
> > Ordering Constraints:
> > Colocation Constraints:
> > Ticket Constraints:
> >
> > Alerts:
> >? No alerts defined
> >
> > Resources Defaults:
> >? No defaults set
> > Operations Defaults:
> >? No defaults set
> >
> > Cluster Properties:
> >? cluster-infrastructure: corosync
> >? cluster-name: ARCOSCLUSTER
> >? dc-version: 1.1.16-12.el7-94ff4df
> >? have-watchdog: false
> >? last-lrm-refresh: 1531810841
> >? stonith-enabled: true
> >
> > Quorum:
> >? ?Options:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180718/69a122cc/attachment-0001.html>