[Pacemaker] "stonith_admin -F node" results in a pair of reboots

Tue Jan 7 11:18:19 EST 2014

Hi Fabio,

> the gfs2 example was not ....

You are forgiven ... and you are light years ahead of me.  I have folks
after my docs, and have not had time to convert notes to docs.  Top of
my to-do list after this task.

I have been fumbling getting a combination of "chkconfig blah on" and
Pacemaker resources that work.  I'll go back to the "chkconfig clvmd on"
combination for another try ... nothing like learning that this works to
get me to try it again!

Yep, running the latest RHEL6.5 updates. Learn (years ago) to
compulsively check for updates when working with HA software.

I'll let you know how this goes.

Thanks,
Bob Haxo

On Tue, 2014-01-07 at 09:21 +0100, Fabio M. Di Nitto wrote:

> On 1/6/2014 6:24 PM, Bob Haxo wrote:
> > Hi Fabio,
> > 
> >>> There is an example on how to configure gfs2 also in the rhel6.5
> >>> pacemaker documentation, using pcs.
> > 
> > Super!  Please share the link to this documentation.  I only discovered
> > the gfs2+pcs example with the rhel7 beta docs.
> 
> You are right, the gfs2 example was not published in Rev 1 of the
> pacemaker documentation for RHEL6.5. It´s entirely possible I missed it
> during doc review, sorry about that!
> 
> https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Configuring_the_Red_Hat_High_Availability_Add-On_with_Pacemaker/index.html
> 
> Short version is:
> 
> chkconfig cman on
> chkconfig clvmd on
> chkconfig pacemaker on
> 
> Use the above doc to setup / start the cluster (stop after stonith config)
> 
> Setup your clvmd storage (note that neither dlm or clvmd are managed by
> pacemaker in RHEL6.5 vs RHEL7 where it´s all managed by pacemaker).
> 
> Start adding your resources/services here etc...
> 
> Also, make absolutely sure you have all the latest updates from 6.5
> Erratas installed.
> 
> Fabio
> 
> 
> > 
> > Bob Haxo
> > 
> > 
> > 
> > On Sat, 2014-01-04 at 16:56 +0100, Fabio M. Di Nitto wrote:
> >> On 01/01/2014 01:57 AM, Bob Haxo wrote:
> >> > Greetings ... Happy New Year!
> >> > 
> >> > I am testing a configuration that is created from example in "Chapter 6.
> >> > Configuring a GFS2 File System in a Cluster" of the "Red Hat Enterprise
> >> > Linux 7.0 Beta Global File System 2" document.  Only addition is
> >> > stonith:fence_ipmilan.  After encountering this issue when I configured
> >> > with "crm", I re-configured using "pcs". I've included the configuration
> >> > below.
> >>
> >> Hold on a second here.. why are you using RHEL7 documentation to
> >> configure RHEL6.5? Please don't mix :) there are some differences and we
> >> definitely never tested mixing those up.
> >>
> >> There is an example on how to configure gfs2 also in the rhel6.5
> >> pacemaker documentation, using pcs.
> >>
> >> I personally never saw this behaviour, so it's entirely possible that
> >> mixing things up will result in unpredictable status.
> >>
> >> Fabio
> >>
> >> > 
> >> > I'm thinking that, in a 2-node cluster, if I run "stonith_admin -F
> >> > <peer-node>", then <peer-node> should reboot and cleanly rejoin the
> >> > cluster.  This is not happening. 
> >> > 
> >> > What ultimately happens is that after the initially fenced node reboots,
> >> > the system from which the stonith_admin -F command was run is fenced and
> >> > reboots. The fencing stops there, leaving the cluster in an appropriate
> >> > state.
> >> > 
> >> > The issue seems to reside with clvmd/lvm.  With the reboot of the
> >> > initially fenced node, the clvmd resource fails on the surviving node,
> >> > with a maximum of errors.  I hypothesize there is an issue with locks,
> >> > but have insufficient knowledge of clvmd/lvm locks to prove or disprove
> >> > this hypothesis.
> >> > 
> >> > Have I missed something ...
> >> > 
> >> > 1) Is this expected behavior, and always the reboot of the fencing node
> >> > happens?
> >> > 
> >> > 2) Or, maybe I didn't correctly duplicate the Chapter 6 example?
> >> > 
> >> > 3) Or, perhaps something is wrong or omitted from the Chapter 6 example?
> >> > 
> >> > Suggestions will be much appreciated.
> >> > 
> >> > Thanks,
> >> > Bob Haxo
> >> > 
> >> > RHEL6.5
> >> > pacemaker-cli-1.1.10-14.el6_5.1.x86_64
> >> > crmsh-1.2.5-55.1sgi709r3.rhel6.x86_64
> >> > pacemaker-libs-1.1.10-14.el6_5.1.x86_64
> >> > cman-3.0.12.1-59.el6_5.1.x86_64
> >> > pacemaker-1.1.10-14.el6_5.1.x86_64
> >> > corosynclib-1.4.1-17.el6.x86_64
> >> > corosync-1.4.1-17.el6.x86_64
> >> > pacemaker-cluster-libs-1.1.10-14.el6_5.1.x86_64
> >> > 
> >> > Cluster Name: mici
> >> > Corosync Nodes:
> >> > 
> >> > Pacemaker Nodes:
> >> > mici-admin mici-admin2
> >> > 
> >> > Resources:
> >> > Clone: clusterfs-clone
> >> >   Meta Attrs: interleave=true target-role=Started
> >> >   Resource: clusterfs (class=ocf provider=heartbeat type=Filesystem)
> >> >    Attributes: device=/dev/vgha2/lv_clust2 directory=/images fstype=gfs2
> >> > options=defaults,noatime,nodiratime
> >> >    Operations: monitor on-fail=fence interval=30s
> >> > (clusterfs-monitor-interval-30s)
> >> > Clone: clvmd-clone
> >> >   Meta Attrs: interleave=true ordered=true target-role=Started
> >> >   Resource: clvmd (class=lsb type=clvmd)
> >> >    Operations: monitor on-fail=fence interval=30s
> >> > (clvmd-monitor-interval-30s)
> >> > Clone: dlm-clone
> >> >   Meta Attrs: interleave=true ordered=true
> >> >   Resource: dlm (class=ocf provider=pacemaker type=controld)
> >> >    Operations: monitor on-fail=fence interval=30s (dlm-monitor-interval-30s)
> >> > 
> >> > Stonith Devices:
> >> > Resource: p_ipmi_fencing_1 (class=stonith type=fence_ipmilan)
> >> >   Attributes: ipaddr=128.##.##.78 login=XXXXX passwd=XXXXX lanplus=1
> >> > action=reboot pcmk_host_check=static-list pcmk_host_list=mici-admin
> >> >   Meta Attrs: target-role=Started
> >> >   Operations: monitor start-delay=30 interval=60s timeout=30
> >> > (p_ipmi_fencing_1-monitor-60s)
> >> > Resource: p_ipmi_fencing_2 (class=stonith type=fence_ipmilan)
> >> >   Attributes: ipaddr=128.##.##.220 login=XXXXX passwd=XXXXX lanplus=1
> >> > action=reboot pcmk_host_check=static-list pcmk_host_list=mici-admin2
> >> >   Meta Attrs: target-role=Started
> >> >   Operations: monitor start-delay=30 interval=60s timeout=30
> >> > (p_ipmi_fencing_2-monitor-60s)
> >> > Fencing Levels:
> >> > 
> >> > Location Constraints:
> >> >   Resource: p_ipmi_fencing_1
> >> >     Disabled on: mici-admin (score:-INFINITY)
> >> > (id:location-p_ipmi_fencing_1-mici-admin--INFINITY)
> >> >   Resource: p_ipmi_fencing_2
> >> >     Disabled on: mici-admin2 (score:-INFINITY)
> >> > (id:location-p_ipmi_fencing_2-mici-admin2--INFINITY)
> >> > Ordering Constraints:
> >> >   start dlm-clone then start clvmd-clone (Mandatory)
> >> > (id:order-dlm-clone-clvmd-clone-mandatory)
> >> >   start clvmd-clone then start clusterfs-clone (Mandatory)
> >> > (id:order-clvmd-clone-clusterfs-clone-mandatory)
> >> > Colocation Constraints:
> >> >   clusterfs-clone with clvmd-clone (INFINITY)
> >> > (id:colocation-clusterfs-clone-clvmd-clone-INFINITY)
> >> >   clvmd-clone with dlm-clone (INFINITY)
> >> > (id:colocation-clvmd-clone-dlm-clone-INFINITY)
> >> > 
> >> > Cluster Properties:
> >> > cluster-infrastructure: cman
> >> > dc-version: 1.1.10-14.el6_5.1-368c726
> >> > last-lrm-refresh: 1388530552
> >> > no-quorum-policy: ignore
> >> > stonith-enabled: true
> >> > Node Attributes:
> >> > mici-admin: standby=off
> >> > mici-admin2: standby=off
> >> > 
> >> > 
> >> > Last updated: Tue Dec 31 17:15:55 2013
> >> > Last change: Tue Dec 31 16:57:37 2013 via cibadmin on mici-admin
> >> > Stack: cman
> >> > Current DC: mici-admin2 - partition with quorum
> >> > Version: 1.1.10-14.el6_5.1-368c726
> >> > 2 Nodes configured
> >> > 8 Resources configured
> >> > 
> >> > Online: [ mici-admin mici-admin2 ]
> >> > 
> >> > Full list of resources:
> >> > 
> >> > p_ipmi_fencing_1        (stonith:fence_ipmilan):        Started mici-admin2
> >> > p_ipmi_fencing_2        (stonith:fence_ipmilan):        Started mici-admin
> >> > Clone Set: clusterfs-clone [clusterfs]
> >> >      Started: [ mici-admin mici-admin2 ]
> >> > Clone Set: clvmd-clone [clvmd]
> >> >      Started: [ mici-admin mici-admin2 ]
> >> > Clone Set: dlm-clone [dlm]
> >> >      Started: [ mici-admin mici-admin2 ]
> >> > 
> >> > Migration summary:
> >> > * Node mici-admin:
> >> > * Node mici-admin2:
> >> > 
> >> > =====================================================
> >> > crm_mon  after the fenced node reboots.  Shows the failure of clvmd that
> >> > then
> >> > occurs, which in turn triggers a fencing of that nnode
> >> > 
> >> > Last updated: Tue Dec 31 17:06:55 2013
> >> > Last change: Tue Dec 31 16:57:37 2013 via cibadmin on mici-admin
> >> > Stack: cman
> >> > Current DC: mici-admin - partition with quorum
> >> > Version: 1.1.10-14.el6_5.1-368c726
> >> > 2 Nodes configured
> >> > 8 Resources configured
> >> > 
> >> > Node mici-admin: UNCLEAN (online)
> >> > Online: [ mici-admin2 ]
> >> > 
> >> > Full list of resources:
> >> > 
> >> > p_ipmi_fencing_1        (stonith:fence_ipmilan):        Stopped
> >> > p_ipmi_fencing_2        (stonith:fence_ipmilan):        Started mici-admin
> >> > Clone Set: clusterfs-clone [clusterfs]
> >> >      Started: [ mici-admin ]
> >> >      Stopped: [ mici-admin2 ]
> >> > Clone Set: clvmd-clone [clvmd]
> >> >      clvmd      (lsb:clvmd):    FAILED mici-admin
> >> >      Stopped: [ mici-admin2 ]
> >> > Clone Set: dlm-clone [dlm]
> >> >      Started: [ mici-admin mici-admin2 ]
> >> > 
> >> > Migration summary:
> >> > * Node mici-admin:
> >> >    clvmd: migration-threshold=1000000 fail-count=1 last-failure='Tue Dec
> >> > 31 17:04:29 2013'
> >> > * Node mici-admin2:
> >> > 
> >> > Failed actions:
> >> >     clvmd_monitor_30000 on mici-admin 'unknown error' (1): call=60,
> >> > status=Timed Out, la
> >> > st-rc-change='Tue Dec 31 17:04:29 2013', queued=0ms, exec=0ms
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > _______________________________________________
> >> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org <mailto:Pacemaker at oss.clusterlabs.org>
> >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> > 
> >> > Project Home: http://www.clusterlabs.org
> >> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> > Bugs: http://bugs.clusterlabs.org
> >> > 
> >>
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org <mailto:Pacemaker at oss.clusterlabs.org>
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140107/d558cdde/attachment-0003.html>