[Pacemaker] node1 fencing itself after node2 being fenced

Tue Feb 18 00:30:05 UTC 2014

On 18 Feb 2014, at 5:52 am, Asgaroth <lists at blueface.com> wrote:

>> -----Original Message-----
>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>> Sent: 17 February 2014 00:55
>> To: lists at blueface.com; The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] node1 fencing itself after node2 being fenced
>> 
>> 
>> If you have configured cman to use fence_pcmk, then all cman/dlm/clvmd
>> fencing operations are sent to Pacemaker.
>> If you aren't running pacemaker, then you have a big problem as no-one can
>> perform fencing.
> 
> I have configured pacemaker as the resource manager and I have it enabled to
> start on boot-up too as follows:
> 
> chkconfig cman on
> chkconfig clvmd on
> chkconfig pacemaker on
> 
>> 
>> I don't know if you are testing without pacemaker running, but if so you
>> would need to configure cman with real fencing devices.
>> 
> 
> I have been testing with pacemaker running and the fencing appears to be
> operating fine, the issue I seem to have is that clvmd is unable re-acquire
> its locks when attempting to rejoin the cluster after a fence operation, so
> it looks like clvmd just hangs when the startup script fires it off on
> boot-up. When the 3rd node is in this state (hung clvmd), then the other 2
> nodes are unable to obtain locks from the third node as clvmd has hung, as
> an example, this is what happens when the 3rd node is hung at the clvmd
> startup phase after pacemaker has issued a fence operation (running pvs on
> node1)

The 3rd node should (and needs to be) fenced at this point to allow the cluster to continue.
Is this not happening?

Did you specify on-fail=fence for the clvmd agent?

> 
> [root at test01 ~]# pvs
>  Error locking on node test03: Command timed out
>  Unable to obtain global lock.
> 
> The dlm elements look fine to me here too:
> 
> [root at test01 ~]# dlm_tool ls
> dlm lockspaces
> name          cdr
> id            0xa8054052
> flags         0x00000008 fs_reg
> change        member 2 joined 0 remove 1 failed 1 seq 2,2
> members       1 2 
> 
> name          clvmd
> id            0x4104eefa
> flags         0x00000000 
> change        member 3 joined 1 remove 0 failed 0 seq 3,3
> members       1 2 3
> 
> So it looks like cman/dlm are operating properly, however, clvmd hangs and
> never exits so pacemaker never starts on the 3rd node. So the 3rd node is in
> "pending" state while clvmd is hung:
> 
> [root at test02 ~]# crm_mon -Afr -1
> Last updated: Mon Feb 17 15:52:28 2014
> Last change: Mon Feb 17 15:43:16 2014 via cibadmin on test01
> Stack: cman
> Current DC: test02 - partition with quorum
> Version: 1.1.10-14.el6_5.2-368c726
> 3 Nodes configured
> 15 Resources configured
> 
> 
> Node test03: pending
> Online: [ test01 test02 ]
> 
> Full list of resources:
> 
> fence_test01      (stonith:fence_vmware_soap):    Started test01 
> fence_test02      (stonith:fence_vmware_soap):    Started test02 
> fence_test03      (stonith:fence_vmware_soap):    Started test01 
> Clone Set: fs_cdr-clone [fs_cdr]
>     Started: [ test01 test02 ]
>     Stopped: [ test03 ]
> Resource Group: sftp01-vip
>     vip-001    (ocf::heartbeat:IPaddr2):       Started test01 
>     vip-002    (ocf::heartbeat:IPaddr2):       Started test01 
> Resource Group: sftp02-vip
>     vip-003    (ocf::heartbeat:IPaddr2):       Started test02 
>     vip-004    (ocf::heartbeat:IPaddr2):       Started test02 
> Resource Group: sftp03-vip
>     vip-005    (ocf::heartbeat:IPaddr2):       Started test02 
>     vip-006    (ocf::heartbeat:IPaddr2):       Started test02 
> sftp01 (lsb:sftp01):   Started test01 
> sftp02 (lsb:sftp02):   Started test02 
> sftp03 (lsb:sftp03):   Started test02 
> 
> Node Attributes:
> * Node test01:
> * Node test02:
> * Node test03:
> 
> Migration summary:
> * Node test03: 
> * Node test02: 
> * Node test01:
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140218/eaaf3b21/attachment-0004.sig>