[ClusterLabs] Antw: Re: What is the logic when two node are down at the same time and needs to be fenced

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Tue Nov 8 09:08:16 UTC 2016


>>> Niu Sibo <niusibo at linux.vnet.ibm.com> schrieb am 07.11.2016 um 16:59 in
Nachricht <5820A4CC.9030001 at linux.vnet.ibm.com>:
> Hi Ken,
> 
> Thanks for the clarification. Now I have another real problem that needs 
> your advise.
> 
> The cluster consists of 5 nodes and one of the node got a 1 second 
> network failure which resulted in one of the VirtualDomain resources to 
> start on two nodes at the same time. The cluster property 
> no_quorum_policy is set to stop.
> 
> At 16:13:34, this happened:
> 16:13:34 zs95kj attrd[133000]:  notice: crm_update_peer_proc: Node 
> zs93KLpcs1[5] - state is now lost (was member)
> 16:13:34 zs95kj corosync[132974]:  [CPG   ] left_list[0] 
> group:pacemakerd\x00, ip:r(0) ip(10.20.93.13) , pid:28721
> 16:13:34 zs95kj crmd[133002]: warning: No match for shutdown action on 5

Usually the node would be fenced now. In the meantime the node might _try_ to stop the resources.

> 16:13:34 zs95kj attrd[133000]:  notice: Removing all zs93KLpcs1 
> attributes for attrd_peer_change_cb
> 16:13:34 zs95kj corosync[132974]:  [CPG   ] left_list_entries:1
> 16:13:34 zs95kj crmd[133002]:  notice: Stonith/shutdown of zs93KLpcs1 
> not matched
> ...
> 16:13:35 zs95kj attrd[133000]:  notice: crm_update_peer_proc: Node 
> zs93KLpcs1[5] - state is now member (was (null))

Where are the logs from the other node? I don't see where resources are _started_.


> 
>  From the DC:
> [root at zs95kj ~]# crm_simulate --xml-file 
> /var/lib/pacemaker/pengine/pe-input-3288.bz2 |grep 110187
>   zs95kjg110187_res      (ocf::heartbeat:VirtualDomain): Started 
> zs93KLpcs1     <----------This is the baseline that everything works normal
> 
> [root at zs95kj ~]# crm_simulate --xml-file 
> /var/lib/pacemaker/pengine/pe-input-3289.bz2 |grep 110187
>   zs95kjg110187_res      (ocf::heartbeat:VirtualDomain): Stopped 
> <----------- Here the node zs93KLpcs1 lost it's network for 1 sec and 
> resulted in this state.
> 
> [root at zs95kj ~]# crm_simulate --xml-file 
> /var/lib/pacemaker/pengine/pe-input-3290.bz2 |grep 110187
>   zs95kjg110187_res      (ocf::heartbeat:VirtualDomain): Stopped
> 
> [root at zs95kj ~]# crm_simulate --xml-file 
> /var/lib/pacemaker/pengine/pe-input-3291.bz2 |grep 110187
>   zs95kjg110187_res      (ocf::heartbeat:VirtualDomain): Stopped
> 
> 
>  From the DC's pengine log, it has:
> 16:05:01 zs95kj pengine[133001]:  notice: Calculated Transition 238: 
> /var/lib/pacemaker/pengine/pe-input-3288.bz2
> ...
> 16:13:41 zs95kj pengine[133001]:  notice: Start 
> zs95kjg110187_res#011(zs90kppcs1)
> ...
> 16:13:41 zs95kj pengine[133001]:  notice: Calculated Transition 239: 
> /var/lib/pacemaker/pengine/pe-input-3289.bz2
> 
>  From the DC's CRMD log, it has:
> Sep  9 16:05:25 zs95kj crmd[133002]:  notice: Transition 238 
> (Complete=48, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-3288.bz2): Complete
> ...
> Sep  9 16:13:42 zs95kj crmd[133002]:  notice: Initiating action 752: 
> start zs95kjg110187_res_start_0 on zs90kppcs1
> ...
> Sep  9 16:13:56 zs95kj crmd[133002]:  notice: Transition 241 
> (Complete=81, Pending=0, Fired=0, Skipped=172, Incomplete=341, 
> Source=/var/lib/pacemaker/pengine/pe-input-3291.bz2): Stopped
> 
> Here I do not see any log about pe-input-3289.bz2 and pe-input-3290.bz2. 
> Why is this?
> 
>  From the log on zs93KLpcs1 where guest 110187 was running, i do not see 
> any message regarding stopping this resource after it lost its 
> connection to the cluster.
> 
> Any ideas where to look for possible cause?
> 
> On 11/3/2016 1:02 AM, Ken Gaillot wrote:
>> On 11/02/2016 11:17 AM, Niu Sibo wrote:
>>> Hi all,
>>>
>>> I have a general question regarding the fence login in pacemaker.
>>>
>>> I have setup a three nodes cluster with Pacemaker 1.1.13 and cluster
>>> property no_quorum_policy set to ignore. When two nodes lost their NIC
>>> corosync is running on at the same time, it looks like the two nodes are
>>> getting fenced one by one, even I have three fence devices defined for
>>> each of the node.
>>>
>>> What should I be expecting in the case?
>> It's probably coincidence that the fencing happens serially; there is
>> nothing enforcing that for separate fence devices. There are many steps
>> in a fencing request, so they can easily take different times to complete.
>>
>>> I noticed if the node rejoins the cluster before the cluster starts the
>>> fence actions, some resources will get activated on 2 nodes at the
>>> sametime. This is really not good if the resource happens to be
>>> VirutalGuest.  Thanks for any suggestions.
>> Since you're ignoring quorum, there's nothing stopping the disconnected
>> node from starting all resources on its own. It can even fence the other
>> nodes, unless the downed NIC is used for fencing. From that node's point
>> of view, it's the other two nodes that are lost.
>>
>> Quorum is the only solution I know of to prevent that. Fencing will
>> correct the situation, but it won't prevent it.
>>
>> See the votequorum(5) man page for various options that can affect how
>> quorum is calculated. Also, the very latest version of corosync supports
>> qdevice (a lightweight daemon that run on a host outside the cluster
>> strictly for the purposes of quorum).
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>>
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 







More information about the Users mailing list