[Pacemaker] GFS2 with Pacemaker on RHEL6.3 restarts with reboot

Sun Aug 12 21:27:57 EDT 2012

On Fri, 2012-08-10 at 12:21 +1000, Andrew Beekhof wrote:
> On Thu, Aug 9, 2012 at 12:14 PM, Bob Haxo <bhaxo at sgi.com> wrote:
> > Greetings.
> >
> > I have followed the setup instructions of Clusters From Scratch :
> > Creating Active/Passive and Active/Active Clusters on Fedora, Edition 5,
> > including locating the new cman pages that do not seem to be linked into
> > the main document, for example,
> >
> > http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch08s02s02.html
> 
> The 1.1 document was updated for corosync 2.x
> I kept the cman/plugin version around but moved it to:
> 
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Clusters_from_Scratch/index.html
> 
> Look for "Version: 1.1-plugin" on the main docs page.

Andrew, much thanks for the response ... and much thanks here ... I had
not connected the dots regarding use of cman being an *earlier* version
of the docs (and software stack).

> 
> >
> > The stack that I'm implementing includes RHEL6.3, drbd, dlm, gfs2,
> > Pacemaker (RHEL6.3 build), cman, kvm ... hopefully I didn't leave
> > anybody off the party list.
> >
> > I have these all working together to support "live" migration of the
> > virt client between the two phys hosts, so at that level, all is good.
> >
> > Questions: Is there a document that covers the fully covers such an
> > installation, meaning the extends the Cluster From Scratch (and replaces
> > the Apache example) to implementation of a HA virtual client? For
> > instance, should libvirtd be handled as a Pacemaker resource, or should
> > it be started as an system service at boot?  What should be done with
> > "libvirt-guests"?
> 
> These things I do not know sorry.
> 
> >  Should cman be started as a system service at boot?
> 
> I prefer not to, but its just a personal preference.
> I run potentially broken versions of the cluster and have been hit
> hard before with processes running amok and putting machines into
> reboot cycles.

Ah, right.  I too in my testing start cman and pacemaker manually.  I
was thinking more of when moving from testing to production.  I think
you have answered that.

> 
> >
> > Problem: When the the non-VM-host is rebooted, then when Pacemaker
> > restarts the gfs2 filesystem gets restarted on the VM host, which causes
> > the stop and start of the VirtualDomain. The gfs2 filesystem also gets
> > restarted without of the VirtualDomain resource included.
> 
> This sounds like the "starting a clone on A causes a restart of the
> clone on B" bug.
> I think we've squashed that one now but not in a released version...
> how confident are you at creating rpms?

:-)  Well "how confident" depends upon the precise meaning of "creating
rpms"  .. if this is building a rpm given a working spec file, then that
I can do. If it is a matter of making mods to an almost working spec
file, that I can do. If it involves creating the spec file from scratch
for a large project, that would be a challenge.

FYI, I'm trying to get Pacemaker accepted for use in a product rather
than rgmanager.

Thanks, Andrew.
Bob Haxo
bhaxo at sgi.com

> 
> > This behavior does not seem correct ... I think I would have flagged it
> > in my memory if I'd encountered the behavior when working with the SLES
> > HAE product.  I've been doing a lot of fumbling this past week trying to
> > get the colocation and order statements correct, without affecting this
> > behavior.
> >
> > What am I missing?
> >
> > Here are the first indications of this restart issue during the restart
> > of Pacemaker and friends with the boot.  I have attached more messages.
> >
> > Aug  8 20:00:57 hikari crmd[2734]:     info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-hikari2-master-drbd_r0.1, name=master-drbd_r0:1, value=5, magic=NA, cib=0.474.170) : Transient attribute: update
> > Aug  8 20:00:57 hikari crmd[2734]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> > Aug  8 20:00:57 hikari pengine[2733]:   notice: unpack_config: On loss of CCM Quorum: Ignore
> > Aug  8 20:00:57 hikari pengine[2733]:   notice: LogActions: Promote drbd_r0:1#011(Slave -> Master hikari2)
> > Aug  8 20:00:57 hikari pengine[2733]:   notice: LogActions: Restart virt#011(Started hikari) <<<<<<<<<<<<<<<<<<
> > Aug  8 20:00:57 hikari pengine[2733]:   notice: LogActions: Restart shared-gfs2:0#011(Started hikari)  <<<<<<<<
> > Aug  8 20:00:57 hikari pengine[2733]:   notice: LogActions: Start   shared-gfs2:1#011(hikari2)
> > Aug  8 20:00:57 hikari crmd[2734]:     info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-hikari2-master-drbd_r1.1, name=master-drbd_r1:1, value=5, magic=NA, cib=0.474.171) : Transient attribute: update
> >
> > Here are the current constraints resulting from fumbling (actually,
> > trying to make sense of all of the information obtained from a Google
> > searches):
> >
> > colocation co-gfs-on-drbd inf: c_shared-gfs2 drbd_r0_clone:Master
> > order o-drbd_r0-then-gfs inf: drbd_r0_clone:promote c_shared-gfs2:start
> > order o-drbd_r1_clone-then-virt inf: drbd_r1_clone virt
> > order o-gfs-then-virt inf: c_shared-gfs2 virt
> >
> > Full config file attached.
> >
> > For reference, here is "service blah status" for the set of services:
> >
> > [root at hikari2 ~]# ha-status
> > ------- service corosync status -------
> > corosync (pid  1996) is running...
> > ------- service cman status -------
> > cluster is running.
> > ------- service drbd status -------
> > drbd driver loaded OK; device status:
> > version: 8.4.1 (api:1/proto:86-100)
> > GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by
> > phil at Build64R6, 2012-04-17 11:28:08
> > m:res  cs         ro               ds                 p  mounted  fstype
> > 1:r0   Connected  Primary/Primary  UpToDate/UpToDate  C  /shared  gfs2
> > 2:r1   Connected  Primary/Primary  UpToDate/UpToDate  C
> > 3:r2   Connected  Primary/Primary  UpToDate/UpToDate  C
> > ------- service pacemaker status -------
> > pacemakerd (pid  8912) is running...
> > ------- service gfs2 status -------
> > Configured GFS2 mountpoints:
> > /shared
> > Active GFS2 mountpoints:
> > /shared
> > ------- service libvirtd status -------
> > libvirtd (pid  2510) is running...
> >
> > [root at hikari ~]# crm_mon -1ro
> > ============
> > Last updated: Wed Aug  8 21:01:47 2012
> > Last change: Wed Aug  8 20:48:49 2012 via cibadmin on hikari
> > Stack: cman
> > Current DC: hikari - partition with quorum
> > Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
> > 2 Nodes configured, 2 expected votes
> > 11 Resources configured.
> > ============
> >
> > Online: [ hikari hikari2 ]
> >
> > Full list of resources:
> >
> >  Master/Slave Set: drbd_r0_clone [drbd_r0]
> >      Masters: [ hikari hikari2 ]
> >  Master/Slave Set: drbd_r1_clone [drbd_r1]
> >      Masters: [ hikari hikari2 ]
> >  Master/Slave Set: drbd_r2_clone [drbd_r2]
> >      Masters: [ hikari hikari2 ]
> >  ipmi-fencing-1 (stonith:fence_ipmilan):        Started hikari
> >  ipmi-fencing-2 (stonith:fence_ipmilan):        Started hikari2
> >  virt   (ocf::heartbeat:VirtualDomain): Started hikari
> >  Clone Set: c_shared-gfs2 [shared-gfs2]
> >      Started: [ hikari hikari2 ]
> >
> > Operations:
> > * Node hikari2:
> >    drbd_r1:1: migration-threshold=1000000
> >     + (17) monitor: interval=60000ms rc=0 (ok)
> >     + (26) promote: rc=0 (ok)
> >    drbd_r0:1: migration-threshold=1000000
> >     + (21) promote: rc=0 (ok)
> >    drbd_r2:1: migration-threshold=1000000
> >     + (19) monitor: interval=60000ms rc=0 (ok)
> >     + (27) promote: rc=0 (ok)
> >    ipmi-fencing-2: migration-threshold=1000000
> >     + (12) start: rc=0 (ok)
> >     + (13) monitor: interval=240000ms rc=0 (ok)
> >    shared-gfs2:1: migration-threshold=1000000
> >     + (25) start: rc=0 (ok)
> > * Node hikari:
> >    drbd_r1:0: migration-threshold=1000000
> >     + (24) promote: rc=0 (ok)
> >    drbd_r2:0: migration-threshold=1000000
> >     + (25) promote: rc=0 (ok)
> >    shared-gfs2:0: migration-threshold=1000000
> >     + (92) start: rc=0 (ok)
> >    drbd_r0:0: migration-threshold=1000000
> >     + (23) promote: rc=0 (ok)
> >    ipmi-fencing-1: migration-threshold=1000000
> >     + (12) start: rc=0 (ok)
> >     + (13) monitor: interval=240000ms rc=0 (ok)
> >    virt: migration-threshold=1000000
> >     + (120) start: rc=0 (ok)
> >     + (121) monitor: interval=10000ms rc=0 (ok)
> >
> > Thanks for reading ...
> > Bob Haxo
> > bhaxo @ sgi.com
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org