[Pacemaker] "stonith_admin -F node" results in a pair of reboots

Wed Jan 1 01:04:57 EST 2014

Did you hook DRBD into pacemaker's fencing using 'crm-fence-peer.sh' and 
set the fencing policy to 'resource-and-stonith;'? If not, do so! It 
will protect against split-brains.

digimer

On 01/01/14 01:03 AM, Bob Haxo wrote:
> Digimer,
>
> Ok, sounds reasonable and I will investigate this further on Jan 2.  WRT
> DRBD ... geeee, I don't recall multiple fencings.  I'll check that also
> on Jan 2.
>
> Emmanuel,
>
> I have not seen pending fencing operations with "dlm_tool ls" ... but I
> have seen the word "pending" elsewhere (crm_mon?) without considering
> that it might be fencing that is pending. Interesting.
>
> Thanks & my best wishes for a healthy new year.
> Bob Haxo
>
>
> On Wed, 2014-01-01 at 00:19 -0500, Digimer wrote:
>> This is probably because cman (which is it's own cluster stack and used
>> to provide DLM and quorum to pacemaker on EL6) detected the node failed
>> after the initial fence and called it's own fence. You see a similar
>> behaviour when using DRBD. It will also call a fence when the peer dies
>> (even when it died because of a controlled fence call). In theory,
>> pacemaker using cman's dlm with DRBD would trigger three fences per
>> failure. :)
>>
>> digimer
>>
>> On 01/01/14 12:04 AM, emmanuel segura wrote:
>> > maybe you missing log when you had fenced the node? because i think the
>> > clvmd hungup because your node are in unclean state, use dlm_tool ls to
>> > see if you any pending fencing operation.
>> >
>> >
>> > 2014/1/1 Bob Haxo <bhaxo at sgi.com  <mailto:bhaxo at sgi.com>  <mailto:bhaxo at sgi.com>>
>> >
>> >     __
>> >     Greetings ... Happy New Year!
>> >
>> >     I am testing a configuration that is created from example in
>> >     "Chapter 6. Configuring a GFS2 File System in a Cluster" of the "Red
>> >     Hat Enterprise Linux 7.0 Beta Global File System 2" document.  Only
>> >     addition is stonith:fence_ipmilan.  After encountering this issue
>> >     when I configured with "crm", I re-configured using "pcs". I've
>> >     included the configuration below.
>> >
>> >     I'm thinking that, in a 2-node cluster, if I run "stonith_admin -F
>> >     <peer-node>", then <peer-node> should reboot and cleanly rejoin the
>> >     cluster.  This is not happening.
>> >
>> >     What ultimately happens is that after the initially fenced node
>> >     reboots, the system from which the stonith_admin -F command was run
>> >     is fenced and reboots. The fencing stops there, leaving the cluster
>> >     in an appropriate state.
>> >
>> >     The issue seems to reside with clvmd/lvm.  With the reboot of the
>> >     initially fenced node, the clvmd resource fails on the surviving
>> >     node, with a maximum of errors.  I hypothesize there is an issue
>> >     with locks, but have insufficient knowledge of clvmd/lvm locks to
>> >     prove or disprove this hypothesis.
>> >
>> >     Have I missed something ...
>> >
>> >     1) Is this expected behavior, and always the reboot of the fencing
>> >     node happens?
>> >
>> >     2) Or, maybe I didn't correctly duplicate the Chapter 6 example?
>> >
>> >     3) Or, perhaps something is wrong or omitted from the Chapter 6 example?
>> >
>> >     Suggestions will be much appreciated.
>> >
>> >     Thanks,
>> >     Bob Haxo
>> >
>> >     RHEL6.5
>> >     pacemaker-cli-1.1.10-14.el6_5.1.x86_64
>> >     crmsh-1.2.5-55.1sgi709r3.rhel6.x86_64
>> >     pacemaker-libs-1.1.10-14.el6_5.1.x86_64
>> >     cman-3.0.12.1-59.el6_5.1.x86_64
>> >     pacemaker-1.1.10-14.el6_5.1.x86_64
>> >     corosynclib-1.4.1-17.el6.x86_64
>> >     corosync-1.4.1-17.el6.x86_64
>> >     pacemaker-cluster-libs-1.1.10-14.el6_5.1.x86_64
>> >
>> >     Cluster Name: mici
>> >     Corosync Nodes:
>> >
>> >     Pacemaker Nodes:
>> >     mici-admin mici-admin2
>> >
>> >     Resources:
>> >     Clone: clusterfs-clone
>> >        Meta Attrs: interleave=true target-role=Started
>> >        Resource: clusterfs (class=ocf provider=heartbeat type=Filesystem)
>> >         Attributes: device=/dev/vgha2/lv_clust2 directory=/images
>> >     fstype=gfs2 options=defaults,noatime,nodiratime
>> >         Operations: monitor on-fail=fence interval=30s
>> >     (clusterfs-monitor-interval-30s)
>> >     Clone: clvmd-clone
>> >        Meta Attrs: interleave=true ordered=true target-role=Started
>> >        Resource: clvmd (class=lsb type=clvmd)
>> >         Operations: monitor on-fail=fence interval=30s
>> >     (clvmd-monitor-interval-30s)
>> >     Clone: dlm-clone
>> >        Meta Attrs: interleave=true ordered=true
>> >        Resource: dlm (class=ocf provider=pacemaker type=controld)
>> >         Operations: monitor on-fail=fence interval=30s
>> >     (dlm-monitor-interval-30s)
>> >
>> >     Stonith Devices:
>> >     Resource: p_ipmi_fencing_1 (class=stonith type=fence_ipmilan)
>> >        Attributes: ipaddr=128.##.##.78 login=XXXXX passwd=XXXXX
>> >     lanplus=1 action=reboot pcmk_host_check=static-list
>> >     pcmk_host_list=mici-admin
>> >        Meta Attrs: target-role=Started
>> >        Operations: monitor start-delay=30 interval=60s timeout=30
>> >     (p_ipmi_fencing_1-monitor-60s)
>> >     Resource: p_ipmi_fencing_2 (class=stonith type=fence_ipmilan)
>> >        Attributes: ipaddr=128.##.##.220 login=XXXXX passwd=XXXXX
>> >     lanplus=1 action=reboot pcmk_host_check=static-list
>> >     pcmk_host_list=mici-admin2
>> >        Meta Attrs: target-role=Started
>> >        Operations: monitor start-delay=30 interval=60s timeout=30
>> >     (p_ipmi_fencing_2-monitor-60s)
>> >     Fencing Levels:
>> >
>> >     Location Constraints:
>> >        Resource: p_ipmi_fencing_1
>> >          Disabled on: mici-admin (score:-INFINITY)
>> >     (id:location-p_ipmi_fencing_1-mici-admin--INFINITY)
>> >        Resource: p_ipmi_fencing_2
>> >          Disabled on: mici-admin2 (score:-INFINITY)
>> >     (id:location-p_ipmi_fencing_2-mici-admin2--INFINITY)
>> >     Ordering Constraints:
>> >        start dlm-clone then start clvmd-clone (Mandatory)
>> >     (id:order-dlm-clone-clvmd-clone-mandatory)
>> >        start clvmd-clone then start clusterfs-clone (Mandatory)
>> >     (id:order-clvmd-clone-clusterfs-clone-mandatory)
>> >     Colocation Constraints:
>> >        clusterfs-clone with clvmd-clone (INFINITY)
>> >     (id:colocation-clusterfs-clone-clvmd-clone-INFINITY)
>> >        clvmd-clone with dlm-clone (INFINITY)
>> >     (id:colocation-clvmd-clone-dlm-clone-INFINITY)
>> >
>> >     Cluster Properties:
>> >     cluster-infrastructure: cman
>> >     dc-version: 1.1.10-14.el6_5.1-368c726
>> >     last-lrm-refresh: 1388530552
>> >     no-quorum-policy: ignore
>> >     stonith-enabled: true
>> >     Node Attributes:
>> >     mici-admin: standby=off
>> >     mici-admin2: standby=off
>> >
>> >
>> >     Last updated: Tue Dec 31 17:15:55 2013
>> >     Last change: Tue Dec 31 16:57:37 2013 via cibadmin on mici-admin
>> >     Stack: cman
>> >     Current DC: mici-admin2 - partition with quorum
>> >     Version: 1.1.10-14.el6_5.1-368c726
>> >     2 Nodes configured
>> >     8 Resources configured
>> >
>> >     Online: [ mici-admin mici-admin2 ]
>> >
>> >     Full list of resources:
>> >
>> >     p_ipmi_fencing_1        (stonith:fence_ipmilan):        Started
>> >     mici-admin2
>> >     p_ipmi_fencing_2        (stonith:fence_ipmilan):        Started
>> >     mici-admin
>> >     Clone Set: clusterfs-clone [clusterfs]
>> >           Started: [ mici-admin mici-admin2 ]
>> >     Clone Set: clvmd-clone [clvmd]
>> >           Started: [ mici-admin mici-admin2 ]
>> >     Clone Set: dlm-clone [dlm]
>> >           Started: [ mici-admin mici-admin2 ]
>> >
>> >     Migration summary:
>> >     * Node mici-admin:
>> >     * Node mici-admin2:
>> >
>> >     =====================================================
>> >     crm_mon  after the fenced node reboots.  Shows the failure of clvmd
>> >     that then
>> >     occurs, which in turn triggers a fencing of that nnode
>> >
>> >     Last updated: Tue Dec 31 17:06:55 2013
>> >     Last change: Tue Dec 31 16:57:37 2013 via cibadmin on mici-admin
>> >     Stack: cman
>> >     Current DC: mici-admin - partition with quorum
>> >     Version: 1.1.10-14.el6_5.1-368c726
>> >     2 Nodes configured
>> >     8 Resources configured
>> >
>> >     Node mici-admin: UNCLEAN (online)
>> >     Online: [ mici-admin2 ]
>> >
>> >     Full list of resources:
>> >
>> >     p_ipmi_fencing_1        (stonith:fence_ipmilan):        Stopped
>> >     p_ipmi_fencing_2        (stonith:fence_ipmilan):        Started
>> >     mici-admin
>> >     Clone Set: clusterfs-clone [clusterfs]
>> >           Started: [ mici-admin ]
>> >           Stopped: [ mici-admin2 ]
>> >     Clone Set: clvmd-clone [clvmd]
>> >           clvmd      (lsb:clvmd):    FAILED mici-admin
>> >           Stopped: [ mici-admin2 ]
>> >     Clone Set: dlm-clone [dlm]
>> >           Started: [ mici-admin mici-admin2 ]
>> >
>> >     Migration summary:
>> >     * Node mici-admin:
>> >         clvmd: migration-threshold=1000000 fail-count=1
>> >     last-failure='Tue Dec 31 17:04:29 2013'
>> >     * Node mici-admin2:
>> >
>> >     Failed actions:
>> >          clvmd_monitor_30000 on mici-admin 'unknown error' (1): call=60,
>> >     status=Timed Out, la
>> >     st-rc-change='Tue Dec 31 17:04:29 2013', queued=0ms, exec=0ms
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >     _______________________________________________
>> >     Pacemaker mailing list:Pacemaker at oss.clusterlabs.org  <mailto:Pacemaker at oss.clusterlabs.org>
>> >     <mailto:Pacemaker at oss.clusterlabs.org>
>> >http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> >     Project Home:http://www.clusterlabs.org
>> >     Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >     Bugs:http://bugs.clusterlabs.org
>> >
>> >
>> >
>> >
>> > --
>> > esta es mi vida e me la vivo hasta que dios quiera
>> >
>> >
>> > _______________________________________________
>> > Pacemaker mailing list:Pacemaker at oss.clusterlabs.org  <mailto:Pacemaker at oss.clusterlabs.org>
>> >http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home:http://www.clusterlabs.org
>> > Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs:http://bugs.clusterlabs.org
>> >
>>
>>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?