[Pacemaker] node1 fencing itself after node2 being fenced
Andrew Beekhof
andrew at beekhof.net
Tue Feb 18 00:30:05 UTC 2014
On 18 Feb 2014, at 5:52 am, Asgaroth <lists at blueface.com> wrote:
>> -----Original Message-----
>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>> Sent: 17 February 2014 00:55
>> To: lists at blueface.com; The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] node1 fencing itself after node2 being fenced
>>
>>
>> If you have configured cman to use fence_pcmk, then all cman/dlm/clvmd
>> fencing operations are sent to Pacemaker.
>> If you aren't running pacemaker, then you have a big problem as no-one can
>> perform fencing.
>
> I have configured pacemaker as the resource manager and I have it enabled to
> start on boot-up too as follows:
>
> chkconfig cman on
> chkconfig clvmd on
> chkconfig pacemaker on
>
>>
>> I don't know if you are testing without pacemaker running, but if so you
>> would need to configure cman with real fencing devices.
>>
>
> I have been testing with pacemaker running and the fencing appears to be
> operating fine, the issue I seem to have is that clvmd is unable re-acquire
> its locks when attempting to rejoin the cluster after a fence operation, so
> it looks like clvmd just hangs when the startup script fires it off on
> boot-up. When the 3rd node is in this state (hung clvmd), then the other 2
> nodes are unable to obtain locks from the third node as clvmd has hung, as
> an example, this is what happens when the 3rd node is hung at the clvmd
> startup phase after pacemaker has issued a fence operation (running pvs on
> node1)
The 3rd node should (and needs to be) fenced at this point to allow the cluster to continue.
Is this not happening?
Did you specify on-fail=fence for the clvmd agent?
>
> [root at test01 ~]# pvs
> Error locking on node test03: Command timed out
> Unable to obtain global lock.
>
> The dlm elements look fine to me here too:
>
> [root at test01 ~]# dlm_tool ls
> dlm lockspaces
> name cdr
> id 0xa8054052
> flags 0x00000008 fs_reg
> change member 2 joined 0 remove 1 failed 1 seq 2,2
> members 1 2
>
> name clvmd
> id 0x4104eefa
> flags 0x00000000
> change member 3 joined 1 remove 0 failed 0 seq 3,3
> members 1 2 3
>
> So it looks like cman/dlm are operating properly, however, clvmd hangs and
> never exits so pacemaker never starts on the 3rd node. So the 3rd node is in
> "pending" state while clvmd is hung:
>
> [root at test02 ~]# crm_mon -Afr -1
> Last updated: Mon Feb 17 15:52:28 2014
> Last change: Mon Feb 17 15:43:16 2014 via cibadmin on test01
> Stack: cman
> Current DC: test02 - partition with quorum
> Version: 1.1.10-14.el6_5.2-368c726
> 3 Nodes configured
> 15 Resources configured
>
>
> Node test03: pending
> Online: [ test01 test02 ]
>
> Full list of resources:
>
> fence_test01 (stonith:fence_vmware_soap): Started test01
> fence_test02 (stonith:fence_vmware_soap): Started test02
> fence_test03 (stonith:fence_vmware_soap): Started test01
> Clone Set: fs_cdr-clone [fs_cdr]
> Started: [ test01 test02 ]
> Stopped: [ test03 ]
> Resource Group: sftp01-vip
> vip-001 (ocf::heartbeat:IPaddr2): Started test01
> vip-002 (ocf::heartbeat:IPaddr2): Started test01
> Resource Group: sftp02-vip
> vip-003 (ocf::heartbeat:IPaddr2): Started test02
> vip-004 (ocf::heartbeat:IPaddr2): Started test02
> Resource Group: sftp03-vip
> vip-005 (ocf::heartbeat:IPaddr2): Started test02
> vip-006 (ocf::heartbeat:IPaddr2): Started test02
> sftp01 (lsb:sftp01): Started test01
> sftp02 (lsb:sftp02): Started test02
> sftp03 (lsb:sftp03): Started test02
>
> Node Attributes:
> * Node test01:
> * Node test02:
> * Node test03:
>
> Migration summary:
> * Node test03:
> * Node test02:
> * Node test01:
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140218/eaaf3b21/attachment-0004.sig>
More information about the Pacemaker
mailing list