[ClusterLabs] Weird Fencing Behavior

Klaus Wenninger kwenning at redhat.com
Wed Jul 18 06:25:16 EDT 2018


On 07/18/2018 06:22 AM, Andrei Borzenkov wrote:
> 18.07.2018 04:21, Confidential Company пишет:
>>>> Hi,
>>>>
>>>> On my two-node active/passive setup, I configured fencing via
>>>> fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I
>>> expected
>>>> that both nodes will be stonithed simultaenously.
>>>>
>>>> On my test scenario, Node1 has ClusterIP resource. When I
>>> disconnect
>>>> service/corosync link physically, Node1 was fenced and Node2 keeps
>>> alive
>>>> given pcmk_delay=0 on both nodes.
>>>>
>>>> Can you explain the behavior above?
>>>>
>>> #node1 could not connect to ESX because links were disconnected. As
>>> the
>>> #most obvious explanation.
>>>
>>> #You have logs, you are the only one who can answer this question
>>> with
>>> #some certainty. Others can only guess.
>>>
>>>
>>> Oops, my bad. I forgot to tell. I have two interfaces on each virtual
>>> machine (nodes). second interface was used for ESX links, so fence
>>> can be executed even though corosync links were disconnected. Looking
>>> forward to your response. Thanks
>> #Having no fence delay means a death match (each node killing the other)
>> #is possible, but it doesn't guarantee that it will happen. Some of the
>> #time, one node will detect the outage and fence the other one before
>> #the other one can react.
>>
>> #It's basically an Old West shoot-out -- they may reach for their guns
>> #at the same time, but one may be quicker.
>>
>> #As Andrei suggested, the logs from both nodes could give you a timeline
>> #of what happened when.
>>
>>
>> Hi andrei, kindly see below logs. Based on time of logs, Node1 should have
>> fenced first Node2, but in actual test/scenario, Node1 was fenced/shutdown
>> by Node2.
>>
> Node1 tried to fence but failed. It could be connectivity, it could be
> credentials.
>
>> Is it possible to have a 2-Node active/passive setup in pacemaker/corosync
>> that the node that gets disconnected/interface down is the only one that
>> gets fenced?
>>
> If you could determine which node was disconnected you would not need
> any fencing at all.

True but there is still good reason taking connection into account.
Of course the foreseen survivor can't know that his peer got
disconnected directly.
But what you can do is that if you see that you are disconnected
yourself (e.g. ping-connection to routers, test-access to some
web-servers, ...) you can decide to shoot with a delay or not
shoot at all because starting services locally would anyway
be no good.
That is the basic idea behind fence_heuristics_ping fence-agent.
There was some discussion just recently about approaches
like that on the list.

Regards,
Klaus
 
>> Thanks guys
>>
>> *LOGS from Node2:*
>>
>> Jul 17 13:33:27 ArcosRhel2 corosync[1048]: [TOTEM ] A processor failed,
>> forming new configuration.
> ...
>> Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 will be
>> fenced because the node is no longer part of the cluster
> ...
>> Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]:  notice: Operation 'reboot'
>> [2323] (call 2 from crmd.1084) for host 'ArcosRhel1' with device 'Fence1'
>> returned: 0 (OK)
>> Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]:  notice: Operation reboot of
>> ArcosRhel1 by ArcosRhel2 for crmd.1084 at ArcosRhel2.0426e6e1: OK
>> Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Stonith operation
>> 2/12:0:0:f9418e1f-1f13-4033-9eaa-aec705f807ef: OK (0)
>> Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Peer ArcosRhel1 was
>> terminated (reboot) by ArcosRhel2 for ArcosRhel2: OK
> ...
>>
>>
>> *LOGS from NODE1*
>> Jul 17 13:33:26 ArcoSRhel1 corosync[1464]: [TOTEM ] A processor failed,
>> forming new configuration....
>> Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Node ArcosRhel2 will be
>> fenced because the node is no longer part of the cluster
> ...
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: warning: Mapping action='off'
>> to pcmk_reboot_action='off'
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not fence
>> (reboot) ArcosRhel2: static-list
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
>> (reboot) ArcosRhel2: static-list
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not fence
>> (reboot) ArcosRhel2: static-list
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
>> (reboot) ArcosRhel2: static-list
>> Jul 17 13:33:46 ArcoSRhel1 fence_vmware_soap: Unable to connect/login to
>> fencing device
>> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
>> fence_vmware_soap[7157] stderr: [ Unable to connect/login to fencing device
>> ]
>> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
>> fence_vmware_soap[7157] stderr: [  ]
>> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
>> fence_vmware_soap[7157] stderr: [  ]
>>
>>
>>
>>
>>
>>
>>>> See my config below:
>>>>
>>>> [root at ArcosRhel2 cluster]# pcs config
>>>> Cluster Name: ARCOSCLUSTER
>>>> Corosync Nodes:
>>>> ? ArcosRhel1 ArcosRhel2
>>>> Pacemaker Nodes:
>>>> ? ArcosRhel1 ArcosRhel2
>>>>
>>>> Resources:
>>>> ? Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>>>> ? ?Attributes: cidr_netmask=32 ip=172.16.10.243
>>>> ? ?Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
>>>> ? ? ? ? ? ? ? ?start interval=0s timeout=20s (ClusterIP-start-
>>> interval-0s)
>>>> ? ? ? ? ? ? ? ?stop interval=0s timeout=20s (ClusterIP-stop-
>>> interval-0s)
>>>> Stonith Devices:
>>>> ? Resource: Fence1 (class=stonith type=fence_vmware_soap)
>>>> ? ?Attributes: action=off ipaddr=172.16.10.151 login=admin
>>> passwd=123pass
>>>> pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s
>>> port=ArcosRhel1(Joniel)
>>>> ssl_insecure=1 pcmk_delay_max=0s
>>>> ? ?Operations: monitor interval=60s (Fence1-monitor-interval-60s)
>>>> ? Resource: fence2 (class=stonith type=fence_vmware_soap)
>>>> ? ?Attributes: action=off ipaddr=172.16.10.152 login=admin
>>> passwd=123pass
>>>> pcmk_delay_max=0s pcmk_host_list=ArcosRhel2
>>> pcmk_monitor_timeout=60s
>>>> port=ArcosRhel2(Ben) ssl_insecure=1
>>>> ? ?Operations: monitor interval=60s (fence2-monitor-interval-60s)
>>>> Fencing Levels:
>>>>
>>>> Location Constraints:
>>>> ? ?Resource: Fence1
>>>> ? ? ?Enabled on: ArcosRhel2 (score:INFINITY)
>>>> (id:location-Fence1-ArcosRhel2-INFINITY)
>>>> ? ?Resource: fence2
>>>> ? ? ?Enabled on: ArcosRhel1 (score:INFINITY)
>>>> (id:location-fence2-ArcosRhel1-INFINITY)
>>>> Ordering Constraints:
>>>> Colocation Constraints:
>>>> Ticket Constraints:
>>>>
>>>> Alerts:
>>>> ? No alerts defined
>>>>
>>>> Resources Defaults:
>>>> ? No defaults set
>>>> Operations Defaults:
>>>> ? No defaults set
>>>>
>>>> Cluster Properties:
>>>> ? cluster-infrastructure: corosync
>>>> ? cluster-name: ARCOSCLUSTER
>>>> ? dc-version: 1.1.16-12.el7-94ff4df
>>>> ? have-watchdog: false
>>>> ? last-lrm-refresh: 1531810841
>>>> ? stonith-enabled: true
>>>>
>>>> Quorum:
>>>> ? ?Options:
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Users mailing list