[Pacemaker] "stonith_admin -F node" results in a pair of reboots

Wed Jan 1 00:19:43 EST 2014

This is probably because cman (which is it's own cluster stack and used 
to provide DLM and quorum to pacemaker on EL6) detected the node failed 
after the initial fence and called it's own fence. You see a similar 
behaviour when using DRBD. It will also call a fence when the peer dies 
(even when it died because of a controlled fence call). In theory, 
pacemaker using cman's dlm with DRBD would trigger three fences per 
failure. :)

digimer

On 01/01/14 12:04 AM, emmanuel segura wrote:
> maybe you missing log when you had fenced the node? because i think the
> clvmd hungup because your node are in unclean state, use dlm_tool ls to
> see if you any pending fencing operation.
>
>
> 2014/1/1 Bob Haxo <bhaxo at sgi.com <mailto:bhaxo at sgi.com>>
>
>     __
>     Greetings ... Happy New Year!
>
>     I am testing a configuration that is created from example in
>     "Chapter 6. Configuring a GFS2 File System in a Cluster" of the "Red
>     Hat Enterprise Linux 7.0 Beta Global File System 2" document.  Only
>     addition is stonith:fence_ipmilan.  After encountering this issue
>     when I configured with "crm", I re-configured using "pcs". I've
>     included the configuration below.
>
>     I'm thinking that, in a 2-node cluster, if I run "stonith_admin -F
>     <peer-node>", then <peer-node> should reboot and cleanly rejoin the
>     cluster.  This is not happening.
>
>     What ultimately happens is that after the initially fenced node
>     reboots, the system from which the stonith_admin -F command was run
>     is fenced and reboots. The fencing stops there, leaving the cluster
>     in an appropriate state.
>
>     The issue seems to reside with clvmd/lvm.  With the reboot of the
>     initially fenced node, the clvmd resource fails on the surviving
>     node, with a maximum of errors.  I hypothesize there is an issue
>     with locks, but have insufficient knowledge of clvmd/lvm locks to
>     prove or disprove this hypothesis.
>
>     Have I missed something ...
>
>     1) Is this expected behavior, and always the reboot of the fencing
>     node happens?
>
>     2) Or, maybe I didn't correctly duplicate the Chapter 6 example?
>
>     3) Or, perhaps something is wrong or omitted from the Chapter 6 example?
>
>     Suggestions will be much appreciated.
>
>     Thanks,
>     Bob Haxo
>
>     RHEL6.5
>     pacemaker-cli-1.1.10-14.el6_5.1.x86_64
>     crmsh-1.2.5-55.1sgi709r3.rhel6.x86_64
>     pacemaker-libs-1.1.10-14.el6_5.1.x86_64
>     cman-3.0.12.1-59.el6_5.1.x86_64
>     pacemaker-1.1.10-14.el6_5.1.x86_64
>     corosynclib-1.4.1-17.el6.x86_64
>     corosync-1.4.1-17.el6.x86_64
>     pacemaker-cluster-libs-1.1.10-14.el6_5.1.x86_64
>
>     Cluster Name: mici
>     Corosync Nodes:
>
>     Pacemaker Nodes:
>     mici-admin mici-admin2
>
>     Resources:
>     Clone: clusterfs-clone
>        Meta Attrs: interleave=true target-role=Started
>        Resource: clusterfs (class=ocf provider=heartbeat type=Filesystem)
>         Attributes: device=/dev/vgha2/lv_clust2 directory=/images
>     fstype=gfs2 options=defaults,noatime,nodiratime
>         Operations: monitor on-fail=fence interval=30s
>     (clusterfs-monitor-interval-30s)
>     Clone: clvmd-clone
>        Meta Attrs: interleave=true ordered=true target-role=Started
>        Resource: clvmd (class=lsb type=clvmd)
>         Operations: monitor on-fail=fence interval=30s
>     (clvmd-monitor-interval-30s)
>     Clone: dlm-clone
>        Meta Attrs: interleave=true ordered=true
>        Resource: dlm (class=ocf provider=pacemaker type=controld)
>         Operations: monitor on-fail=fence interval=30s
>     (dlm-monitor-interval-30s)
>
>     Stonith Devices:
>     Resource: p_ipmi_fencing_1 (class=stonith type=fence_ipmilan)
>        Attributes: ipaddr=128.##.##.78 login=XXXXX passwd=XXXXX
>     lanplus=1 action=reboot pcmk_host_check=static-list
>     pcmk_host_list=mici-admin
>        Meta Attrs: target-role=Started
>        Operations: monitor start-delay=30 interval=60s timeout=30
>     (p_ipmi_fencing_1-monitor-60s)
>     Resource: p_ipmi_fencing_2 (class=stonith type=fence_ipmilan)
>        Attributes: ipaddr=128.##.##.220 login=XXXXX passwd=XXXXX
>     lanplus=1 action=reboot pcmk_host_check=static-list
>     pcmk_host_list=mici-admin2
>        Meta Attrs: target-role=Started
>        Operations: monitor start-delay=30 interval=60s timeout=30
>     (p_ipmi_fencing_2-monitor-60s)
>     Fencing Levels:
>
>     Location Constraints:
>        Resource: p_ipmi_fencing_1
>          Disabled on: mici-admin (score:-INFINITY)
>     (id:location-p_ipmi_fencing_1-mici-admin--INFINITY)
>        Resource: p_ipmi_fencing_2
>          Disabled on: mici-admin2 (score:-INFINITY)
>     (id:location-p_ipmi_fencing_2-mici-admin2--INFINITY)
>     Ordering Constraints:
>        start dlm-clone then start clvmd-clone (Mandatory)
>     (id:order-dlm-clone-clvmd-clone-mandatory)
>        start clvmd-clone then start clusterfs-clone (Mandatory)
>     (id:order-clvmd-clone-clusterfs-clone-mandatory)
>     Colocation Constraints:
>        clusterfs-clone with clvmd-clone (INFINITY)
>     (id:colocation-clusterfs-clone-clvmd-clone-INFINITY)
>        clvmd-clone with dlm-clone (INFINITY)
>     (id:colocation-clvmd-clone-dlm-clone-INFINITY)
>
>     Cluster Properties:
>     cluster-infrastructure: cman
>     dc-version: 1.1.10-14.el6_5.1-368c726
>     last-lrm-refresh: 1388530552
>     no-quorum-policy: ignore
>     stonith-enabled: true
>     Node Attributes:
>     mici-admin: standby=off
>     mici-admin2: standby=off
>
>
>     Last updated: Tue Dec 31 17:15:55 2013
>     Last change: Tue Dec 31 16:57:37 2013 via cibadmin on mici-admin
>     Stack: cman
>     Current DC: mici-admin2 - partition with quorum
>     Version: 1.1.10-14.el6_5.1-368c726
>     2 Nodes configured
>     8 Resources configured
>
>     Online: [ mici-admin mici-admin2 ]
>
>     Full list of resources:
>
>     p_ipmi_fencing_1        (stonith:fence_ipmilan):        Started
>     mici-admin2
>     p_ipmi_fencing_2        (stonith:fence_ipmilan):        Started
>     mici-admin
>     Clone Set: clusterfs-clone [clusterfs]
>           Started: [ mici-admin mici-admin2 ]
>     Clone Set: clvmd-clone [clvmd]
>           Started: [ mici-admin mici-admin2 ]
>     Clone Set: dlm-clone [dlm]
>           Started: [ mici-admin mici-admin2 ]
>
>     Migration summary:
>     * Node mici-admin:
>     * Node mici-admin2:
>
>     =====================================================
>     crm_mon  after the fenced node reboots.  Shows the failure of clvmd
>     that then
>     occurs, which in turn triggers a fencing of that nnode
>
>     Last updated: Tue Dec 31 17:06:55 2013
>     Last change: Tue Dec 31 16:57:37 2013 via cibadmin on mici-admin
>     Stack: cman
>     Current DC: mici-admin - partition with quorum
>     Version: 1.1.10-14.el6_5.1-368c726
>     2 Nodes configured
>     8 Resources configured
>
>     Node mici-admin: UNCLEAN (online)
>     Online: [ mici-admin2 ]
>
>     Full list of resources:
>
>     p_ipmi_fencing_1        (stonith:fence_ipmilan):        Stopped
>     p_ipmi_fencing_2        (stonith:fence_ipmilan):        Started
>     mici-admin
>     Clone Set: clusterfs-clone [clusterfs]
>           Started: [ mici-admin ]
>           Stopped: [ mici-admin2 ]
>     Clone Set: clvmd-clone [clvmd]
>           clvmd      (lsb:clvmd):    FAILED mici-admin
>           Stopped: [ mici-admin2 ]
>     Clone Set: dlm-clone [dlm]
>           Started: [ mici-admin mici-admin2 ]
>
>     Migration summary:
>     * Node mici-admin:
>         clvmd: migration-threshold=1000000 fail-count=1
>     last-failure='Tue Dec 31 17:04:29 2013'
>     * Node mici-admin2:
>
>     Failed actions:
>          clvmd_monitor_30000 on mici-admin 'unknown error' (1): call=60,
>     status=Timed Out, la
>     st-rc-change='Tue Dec 31 17:04:29 2013', queued=0ms, exec=0ms
>
>
>
>
>
>
>
>
>     _______________________________________________
>     Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>     <mailto:Pacemaker at oss.clusterlabs.org>
>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>     Project Home: http://www.clusterlabs.org
>     Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     Bugs: http://bugs.clusterlabs.org
>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?