[ClusterLabs] Weird Fencing Behavior

Confidential Company sgurovosa at gmail.com
Wed Jul 18 08:34:42 EDT 2018


>>>> Hi,
>>>>
>>>> On my two-node active/passive setup, I configured fencing via
>>>> fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I
>>> expected
>>>> that both nodes will be stonithed simultaenously.
>>>>
>>>> On my test scenario, Node1 has ClusterIP resource. When I
>>> disconnect
>>>> service/corosync link physically, Node1 was fenced and Node2 keeps
>>> alive
>>>> given pcmk_delay=0 on both nodes.
>>>>
>>>> Can you explain the behavior above?
>>>>
>>> #node1 could not connect to ESX because links were disconnected. As
>>> the
>>> #most obvious explanation.
>>>
>>> #You have logs, you are the only one who can answer this question
>>> with
>>> #some certainty. Others can only guess.
>>>
>>>
>>> Oops, my bad. I forgot to tell. I have two interfaces on each virtual
>>> machine (nodes). second interface was used for ESX links, so fence
>>> can be executed even though corosync links were disconnected. Looking
>>> forward to your response. Thanks
>> #Having no fence delay means a death match (each node killing the other)
>> #is possible, but it doesn't guarantee that it will happen. Some of the
>> #time, one node will detect the outage and fence the other one before
>> #the other one can react.
>>
>> #It's basically an Old West shoot-out -- they may reach for their guns
>> #at the same time, but one may be quicker.
>>
>> #As Andrei suggested, the logs from both nodes could give you a timeline
>> #of what happened when.
>>
>>
>> Hi andrei, kindly see below logs. Based on time of logs, Node1 should
have
>> fenced first Node2, but in actual test/scenario, Node1 was
fenced/shutdown
>> by Node2.
>>
> Node1 tried to fence but failed. It could be connectivity, it could be
> credentials.
>

Maybe this is the reason but it's still weird, I run so many tests and I
conclude all of them have pattern, the Node that was physically
disconnected is the one that gets fenced. It's not random.

See diagram on this link:
https://drive.google.com/open?id=1pbJef_wJdQelJSv1L72c4H6NAvUqV_p-

And also based on my test, if Node1 gets fenced, after reboot, it doesn't
automatically run the cluster. Different from what happens on Node2, even
after reboot, it automatically run/join the cluster.


>> Is it possible to have a 2-Node active/passive setup in
pacemaker/corosync
>> that the node that gets disconnected/interface down is the only one that
>> gets fenced?
>>
> If you could determine which node was disconnected you would not need
> any fencing at all.

#True but there is still good reason taking connection into account.
#Of course the foreseen survivor can't know that his peer got
#disconnected directly.
#But what you can do is that if you see that you are disconnected
#yourself (e.g. ping-connection to routers, test-access to some
#web-servers, ...) you can decide to shoot with a delay or not
#shoot at all because starting services locally would anyway
#be no good.
#That is the basic idea behind fence_heuristics_ping fence-agent.
#There was some discussion just recently about approaches
#like that on the list.

#Regards,
#Klaus

fence_heuristics_ping seems not available on my Rhel7 version. I do wonder
if it is deprecated.


?
>> Thanks guys
>>
>> *LOGS from Node2:*
>>
>> Jul 17 13:33:27 ArcosRhel2 corosync[1048]: [TOTEM ] A processor failed,
>> forming new configuration.
> ...
>> Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 will
be
>> fenced because the node is no longer part of the cluster
> ...
>> Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]:  notice: Operation 'reboot'
>> [2323] (call 2 from crmd.1084) for host 'ArcosRhel1' with device 'Fence1'
>> returned: 0 (OK)
>> Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]:  notice: Operation reboot of
>> ArcosRhel1 by ArcosRhel2 for crmd.1084 at ArcosRhel2.0426e6e1: OK
>> Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Stonith operation
>> 2/12:0:0:f9418e1f-1f13-4033-9eaa-aec705f807ef: OK (0)
>> Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Peer ArcosRhel1 was
>> terminated (reboot) by ArcosRhel2 for ArcosRhel2: OK
> ...
>>
>>
>> *LOGS from NODE1*
>> Jul 17 13:33:26 ArcoSRhel1 corosync[1464]: [TOTEM ] A processor failed,
>> forming new configuration....
>> Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Node ArcosRhel2 will
be
>> fenced because the node is no longer part of the cluster
> ...
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: warning: Mapping
action='off'
>> to pcmk_reboot_action='off'
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not
fence
>> (reboot) ArcosRhel2: static-list
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
>> (reboot) ArcosRhel2: static-list
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not
fence
>> (reboot) ArcosRhel2: static-list
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
>> (reboot) ArcosRhel2: static-list
>> Jul 17 13:33:46 ArcoSRhel1 fence_vmware_soap: Unable to connect/login to
>> fencing device
>> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
>> fence_vmware_soap[7157] stderr: [ Unable to connect/login to fencing
device
>> ]
>> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
>> fence_vmware_soap[7157] stderr: [  ]
>> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
>> fence_vmware_soap[7157] stderr: [  ]
>>
>>
>>
>>
>>
>>
>>>> See my config below:
>>>>
>>>> [root at ArcosRhel2 cluster]# pcs config
>>>> Cluster Name: ARCOSCLUSTER
>>>> Corosync Nodes:
>>>> ? ArcosRhel1 ArcosRhel2
>>>> Pacemaker Nodes:
>>>> ? ArcosRhel1 ArcosRhel2
>>>>
>>>> Resources:
>>>> ? Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>>>> ? ?Attributes: cidr_netmask=32 ip=172.16.10.243
>>>> ? ?Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
>>>> ? ? ? ? ? ? ? ?start interval=0s timeout=20s (ClusterIP-start-
>>> interval-0s)
>>>> ? ? ? ? ? ? ? ?stop interval=0s timeout=20s (ClusterIP-stop-
>>> interval-0s)
>>>> Stonith Devices:
>>>> ? Resource: Fence1 (class=stonith type=fence_vmware_soap)
>>>> ? ?Attributes: action=off ipaddr=172.16.10.151 login=admin
>>> passwd=123pass
>>>> pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s
>>> port=ArcosRhel1(Joniel)
>>>> ssl_insecure=1 pcmk_delay_max=0s
>>>> ? ?Operations: monitor interval=60s (Fence1-monitor-interval-60s)
>>>> ? Resource: fence2 (class=stonith type=fence_vmware_soap)
>>>> ? ?Attributes: action=off ipaddr=172.16.10.152 login=admin
>>> passwd=123pass
>>>> pcmk_delay_max=0s pcmk_host_list=ArcosRhel2
>>> pcmk_monitor_timeout=60s
>>>> port=ArcosRhel2(Ben) ssl_insecure=1
>>>> ? ?Operations: monitor interval=60s (fence2-monitor-interval-60s)
>>>> Fencing Levels:
>>>>
>>>> Location Constraints:
>>>> ? ?Resource: Fence1
>>>> ? ? ?Enabled on: ArcosRhel2 (score:INFINITY)
>>>> (id:location-Fence1-ArcosRhel2-INFINITY)
>>>> ? ?Resource: fence2
>>>> ? ? ?Enabled on: ArcosRhel1 (score:INFINITY)
>>>> (id:location-fence2-ArcosRhel1-INFINITY)
>>>> Ordering Constraints:
>>>> Colocation Constraints:
>>>> Ticket Constraints:
>>>>
>>>> Alerts:
>>>> ? No alerts defined
>>>>
>>>> Resources Defaults:
>>>> ? No defaults set
>>>> Operations Defaults:
>>>> ? No defaults set
>>>>
>>>> Cluster Properties:
>>>> ? cluster-infrastructure: corosync
>>>> ? cluster-name: ARCOSCLUSTER
>>>> ? dc-version: 1.1.16-12.el7-94ff4df
>>>> ? have-watchdog: false
>>>> ? last-lrm-refresh: 1531810841
>>>> ? stonith-enabled: true
>>>>
>>>> Quorum:
>>>> ? ?Options:

On Wed, Jul 18, 2018 at 8:00 PM, <users-request at clusterlabs.org> wrote:

> Send Users mailing list submissions to
>         users at clusterlabs.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://lists.clusterlabs.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
>         users-request at clusterlabs.org
>
> You can reach the person managing the list at
>         users-owner at clusterlabs.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Users digest..."
>
>
> Today's Topics:
>
>    1. Re: Weird Fencing Behavior (Andrei Borzenkov)
>    2. Re: Weird Fencing Behavior (Klaus Wenninger)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 18 Jul 2018 07:22:25 +0300
> From: Andrei Borzenkov <arvidjaar at gmail.com>
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] Weird Fencing Behavior
> Message-ID: <a58c2151-2519-46c0-209c-8f19cd0c7646 at gmail.com>
> Content-Type: text/plain; charset=utf-8
>
> 18.07.2018 04:21, Confidential Company ?????:
> >>> Hi,
> >>>
> >>> On my two-node active/passive setup, I configured fencing via
> >>> fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I
> >> expected
> >>> that both nodes will be stonithed simultaenously.
> >>>
> >>> On my test scenario, Node1 has ClusterIP resource. When I
> >> disconnect
> >>> service/corosync link physically, Node1 was fenced and Node2 keeps
> >> alive
> >>> given pcmk_delay=0 on both nodes.
> >>>
> >>> Can you explain the behavior above?
> >>>
> >>
> >> #node1 could not connect to ESX because links were disconnected. As
> >> the
> >> #most obvious explanation.
> >>
> >> #You have logs, you are the only one who can answer this question
> >> with
> >> #some certainty. Others can only guess.
> >>
> >>
> >> Oops, my bad. I forgot to tell. I have two interfaces on each virtual
> >> machine (nodes). second interface was used for ESX links, so fence
> >> can be executed even though corosync links were disconnected. Looking
> >> forward to your response. Thanks
> >
> > #Having no fence delay means a death match (each node killing the other)
> > #is possible, but it doesn't guarantee that it will happen. Some of the
> > #time, one node will detect the outage and fence the other one before
> > #the other one can react.
> >
> > #It's basically an Old West shoot-out -- they may reach for their guns
> > #at the same time, but one may be quicker.
> >
> > #As Andrei suggested, the logs from both nodes could give you a timeline
> > #of what happened when.
> >
> >
> > Hi andrei, kindly see below logs. Based on time of logs, Node1 should
> have
> > fenced first Node2, but in actual test/scenario, Node1 was
> fenced/shutdown
> > by Node2.
> >
>
> Node1 tried to fence but failed. It could be connectivity, it could be
> credentials.
>
> > Is it possible to have a 2-Node active/passive setup in
> pacemaker/corosync
> > that the node that gets disconnected/interface down is the only one that
> > gets fenced?
> >
>
> If you could determine which node was disconnected you would not need
> any fencing at all.
>
> > Thanks guys
> >
> > *LOGS from Node2:*
> >
> > Jul 17 13:33:27 ArcosRhel2 corosync[1048]: [TOTEM ] A processor failed,
> > forming new configuration.
> ...
> > Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 will
> be
> > fenced because the node is no longer part of the cluster
> ...
> > Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]:  notice: Operation 'reboot'
> > [2323] (call 2 from crmd.1084) for host 'ArcosRhel1' with device 'Fence1'
> > returned: 0 (OK)
> > Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]:  notice: Operation reboot of
> > ArcosRhel1 by ArcosRhel2 for crmd.1084 at ArcosRhel2.0426e6e1: OK
> > Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Stonith operation
> > 2/12:0:0:f9418e1f-1f13-4033-9eaa-aec705f807ef: OK (0)
> > Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Peer ArcosRhel1 was
> > terminated (reboot) by ArcosRhel2 for ArcosRhel2: OK
> ...
> >
> >
> >
> > *LOGS from NODE1*
> > Jul 17 13:33:26 ArcoSRhel1 corosync[1464]: [TOTEM ] A processor failed,
> > forming new configuration....
> > Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Node ArcosRhel2 will
> be
> > fenced because the node is no longer part of the cluster
> ...
> > Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: warning: Mapping
> action='off'
> > to pcmk_reboot_action='off'
> > Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not
> fence
> > (reboot) ArcosRhel2: static-list
> > Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
> > (reboot) ArcosRhel2: static-list
> > Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not
> fence
> > (reboot) ArcosRhel2: static-list
> > Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
> > (reboot) ArcosRhel2: static-list
> > Jul 17 13:33:46 ArcoSRhel1 fence_vmware_soap: Unable to connect/login to
> > fencing device
> > Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
> > fence_vmware_soap[7157] stderr: [ Unable to connect/login to fencing
> device
> > ]
> > Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
> > fence_vmware_soap[7157] stderr: [  ]
> > Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
> > fence_vmware_soap[7157] stderr: [  ]
> >
> >
> >
> >
> >
> >
> >>> See my config below:
> >>>
> >>> [root at ArcosRhel2 cluster]# pcs config
> >>> Cluster Name: ARCOSCLUSTER
> >>> Corosync Nodes:
> >>> ? ArcosRhel1 ArcosRhel2
> >>> Pacemaker Nodes:
> >>> ? ArcosRhel1 ArcosRhel2
> >>>
> >>> Resources:
> >>> ? Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
> >>> ? ?Attributes: cidr_netmask=32 ip=172.16.10.243
> >>> ? ?Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
> >>> ? ? ? ? ? ? ? ?start interval=0s timeout=20s (ClusterIP-start-
> >> interval-0s)
> >>> ? ? ? ? ? ? ? ?stop interval=0s timeout=20s (ClusterIP-stop-
> >> interval-0s)
> >>>
> >>> Stonith Devices:
> >>> ? Resource: Fence1 (class=stonith type=fence_vmware_soap)
> >>> ? ?Attributes: action=off ipaddr=172.16.10.151 login=admin
> >> passwd=123pass
> >>> pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s
> >> port=ArcosRhel1(Joniel)
> >>> ssl_insecure=1 pcmk_delay_max=0s
> >>> ? ?Operations: monitor interval=60s (Fence1-monitor-interval-60s)
> >>> ? Resource: fence2 (class=stonith type=fence_vmware_soap)
> >>> ? ?Attributes: action=off ipaddr=172.16.10.152 login=admin
> >> passwd=123pass
> >>> pcmk_delay_max=0s pcmk_host_list=ArcosRhel2
> >> pcmk_monitor_timeout=60s
> >>> port=ArcosRhel2(Ben) ssl_insecure=1
> >>> ? ?Operations: monitor interval=60s (fence2-monitor-interval-60s)
> >>> Fencing Levels:
> >>>
> >>> Location Constraints:
> >>> ? ?Resource: Fence1
> >>> ? ? ?Enabled on: ArcosRhel2 (score:INFINITY)
> >>> (id:location-Fence1-ArcosRhel2-INFINITY)
> >>> ? ?Resource: fence2
> >>> ? ? ?Enabled on: ArcosRhel1 (score:INFINITY)
> >>> (id:location-fence2-ArcosRhel1-INFINITY)
> >>> Ordering Constraints:
> >>> Colocation Constraints:
> >>> Ticket Constraints:
> >>>
> >>> Alerts:
> >>> ? No alerts defined
> >>>
> >>> Resources Defaults:
> >>> ? No defaults set
> >>> Operations Defaults:
> >>> ? No defaults set
> >>>
> >>> Cluster Properties:
> >>> ? cluster-infrastructure: corosync
> >>> ? cluster-name: ARCOSCLUSTER
> >>> ? dc-version: 1.1.16-12.el7-94ff4df
> >>> ? have-watchdog: false
> >>> ? last-lrm-refresh: 1531810841
> >>> ? stonith-enabled: true
> >>>
> >>> Quorum:
> >>> ? ?Options:
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180718/94568a9c/attachment-0002.html>


More information about the Users mailing list