[ClusterLabs] Weird Fencing Behavior

Andrei Borzenkov arvidjaar at gmail.com
Wed Jul 18 00:22:25 EDT 2018


18.07.2018 04:21, Confidential Company пишет:
>>> Hi,
>>>
>>> On my two-node active/passive setup, I configured fencing via
>>> fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I
>> expected
>>> that both nodes will be stonithed simultaenously.
>>>
>>> On my test scenario, Node1 has ClusterIP resource. When I
>> disconnect
>>> service/corosync link physically, Node1 was fenced and Node2 keeps
>> alive
>>> given pcmk_delay=0 on both nodes.
>>>
>>> Can you explain the behavior above?
>>>
>>
>> #node1 could not connect to ESX because links were disconnected. As
>> the
>> #most obvious explanation.
>>
>> #You have logs, you are the only one who can answer this question
>> with
>> #some certainty. Others can only guess.
>>
>>
>> Oops, my bad. I forgot to tell. I have two interfaces on each virtual
>> machine (nodes). second interface was used for ESX links, so fence
>> can be executed even though corosync links were disconnected. Looking
>> forward to your response. Thanks
> 
> #Having no fence delay means a death match (each node killing the other)
> #is possible, but it doesn't guarantee that it will happen. Some of the
> #time, one node will detect the outage and fence the other one before
> #the other one can react.
> 
> #It's basically an Old West shoot-out -- they may reach for their guns
> #at the same time, but one may be quicker.
> 
> #As Andrei suggested, the logs from both nodes could give you a timeline
> #of what happened when.
> 
> 
> Hi andrei, kindly see below logs. Based on time of logs, Node1 should have
> fenced first Node2, but in actual test/scenario, Node1 was fenced/shutdown
> by Node2.
> 

Node1 tried to fence but failed. It could be connectivity, it could be
credentials.

> Is it possible to have a 2-Node active/passive setup in pacemaker/corosync
> that the node that gets disconnected/interface down is the only one that
> gets fenced?
> 

If you could determine which node was disconnected you would not need
any fencing at all.

> Thanks guys
> 
> *LOGS from Node2:*
> 
> Jul 17 13:33:27 ArcosRhel2 corosync[1048]: [TOTEM ] A processor failed,
> forming new configuration.
...
> Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 will be
> fenced because the node is no longer part of the cluster
...
> Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]:  notice: Operation 'reboot'
> [2323] (call 2 from crmd.1084) for host 'ArcosRhel1' with device 'Fence1'
> returned: 0 (OK)
> Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]:  notice: Operation reboot of
> ArcosRhel1 by ArcosRhel2 for crmd.1084 at ArcosRhel2.0426e6e1: OK
> Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Stonith operation
> 2/12:0:0:f9418e1f-1f13-4033-9eaa-aec705f807ef: OK (0)
> Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Peer ArcosRhel1 was
> terminated (reboot) by ArcosRhel2 for ArcosRhel2: OK
...
> 
> 
> 
> *LOGS from NODE1*
> Jul 17 13:33:26 ArcoSRhel1 corosync[1464]: [TOTEM ] A processor failed,
> forming new configuration....
> Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Node ArcosRhel2 will be
> fenced because the node is no longer part of the cluster
...
> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: warning: Mapping action='off'
> to pcmk_reboot_action='off'
> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not fence
> (reboot) ArcosRhel2: static-list
> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
> (reboot) ArcosRhel2: static-list
> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not fence
> (reboot) ArcosRhel2: static-list
> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
> (reboot) ArcosRhel2: static-list
> Jul 17 13:33:46 ArcoSRhel1 fence_vmware_soap: Unable to connect/login to
> fencing device
> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
> fence_vmware_soap[7157] stderr: [ Unable to connect/login to fencing device
> ]
> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
> fence_vmware_soap[7157] stderr: [  ]
> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
> fence_vmware_soap[7157] stderr: [  ]
> 
> 
> 
> 
> 
> 
>>> See my config below:
>>>
>>> [root at ArcosRhel2 cluster]# pcs config
>>> Cluster Name: ARCOSCLUSTER
>>> Corosync Nodes:
>>> ? ArcosRhel1 ArcosRhel2
>>> Pacemaker Nodes:
>>> ? ArcosRhel1 ArcosRhel2
>>>
>>> Resources:
>>> ? Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>>> ? ?Attributes: cidr_netmask=32 ip=172.16.10.243
>>> ? ?Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
>>> ? ? ? ? ? ? ? ?start interval=0s timeout=20s (ClusterIP-start-
>> interval-0s)
>>> ? ? ? ? ? ? ? ?stop interval=0s timeout=20s (ClusterIP-stop-
>> interval-0s)
>>>
>>> Stonith Devices:
>>> ? Resource: Fence1 (class=stonith type=fence_vmware_soap)
>>> ? ?Attributes: action=off ipaddr=172.16.10.151 login=admin
>> passwd=123pass
>>> pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s
>> port=ArcosRhel1(Joniel)
>>> ssl_insecure=1 pcmk_delay_max=0s
>>> ? ?Operations: monitor interval=60s (Fence1-monitor-interval-60s)
>>> ? Resource: fence2 (class=stonith type=fence_vmware_soap)
>>> ? ?Attributes: action=off ipaddr=172.16.10.152 login=admin
>> passwd=123pass
>>> pcmk_delay_max=0s pcmk_host_list=ArcosRhel2
>> pcmk_monitor_timeout=60s
>>> port=ArcosRhel2(Ben) ssl_insecure=1
>>> ? ?Operations: monitor interval=60s (fence2-monitor-interval-60s)
>>> Fencing Levels:
>>>
>>> Location Constraints:
>>> ? ?Resource: Fence1
>>> ? ? ?Enabled on: ArcosRhel2 (score:INFINITY)
>>> (id:location-Fence1-ArcosRhel2-INFINITY)
>>> ? ?Resource: fence2
>>> ? ? ?Enabled on: ArcosRhel1 (score:INFINITY)
>>> (id:location-fence2-ArcosRhel1-INFINITY)
>>> Ordering Constraints:
>>> Colocation Constraints:
>>> Ticket Constraints:
>>>
>>> Alerts:
>>> ? No alerts defined
>>>
>>> Resources Defaults:
>>> ? No defaults set
>>> Operations Defaults:
>>> ? No defaults set
>>>
>>> Cluster Properties:
>>> ? cluster-infrastructure: corosync
>>> ? cluster-name: ARCOSCLUSTER
>>> ? dc-version: 1.1.16-12.el7-94ff4df
>>> ? have-watchdog: false
>>> ? last-lrm-refresh: 1531810841
>>> ? stonith-enabled: true
>>>
>>> Quorum:
>>> ? ?Options:
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 




More information about the Users mailing list