[Pacemaker] bug in ordering syntax?

Andrew Beekhof andrew at beekhof.net
Thu Dec 3 03:41:57 EST 2009


On Wed, Dec 2, 2009 at 3:22 PM, Frank DiMeo <Frank.DiMeo at bigbandnet.com> wrote:
> Ask and ye shall receive. :)
>
> I'm enclosing my openais init script, which I'm running on my two node cluster made up of identical Ubuntu (9.04) machines called ubuntu_2 and ubuntu_1.

If the node takes more than 30s to shut down, then it kills openais.
In that case, its no surprise that the lrmd and pengine are still
around - because the cluster didn't have time to shut down cleanly.

> Running pacemaker 1.06 from the tip as of a month ago or so.
>
> I'm also enclosing two sets of files which may help you see whats happening.
>
> The "working" set:
>
> 4rsc_worlds_coloc_ordered.xml - this is my initial configuration file.  When I use this to initial my cluster, the 4 resources all start up in order, on the right node, and move together when I put nodes in and out of standby.
>
> goodconfig_debug.txt - the log file (from ubuntu_1) showing what happens when the resources are running on node "ubuntu_2" and I put that node into standby.  All resources are moved to "ubuntu_1".  If I stop openais, everything shuts down quickly and clean, and no processes (like lrmd, pengine, etc) are left running.
>
> The "not working" set:

Can you attach /var/lib/pengine/pe-input-12434.bz2 from ubuntu_1 please?

>
> 4rsc_worlds_coloc_ordered_alt1.xml - this is identical to the xml file in the working set, except I use the compact syntax for ordering.
>
> badconfig_debug.txt - the log file (from ubuntu_1) showing what happens when the resources are running on node "ubuntu_2" and I put that node into standby.  The pe wants to move them to ubuntu_1, but the pe only seems to generate "pseudo actions" and never really moves anything.  The resources continue to run on node ubuntu_2 even when the node is in standby!  Further, if I try to shut down openais on ubuntu_2 at this point (using the /etc/init.d/openais script enclosed), after a long time, corosync stops, but lrmd and pengine keep running, and become children of the init process.  Again, the resources keep running even at this point, which is because they are never commanded to stop.
>
> I can send you my RA's and the resources themselves (which are just bash scripts) if you'd like.
>
> I'll apply the patch you pointed to and let you know what happens.
>
> Thanks very much,
> -Frank
>
>
>> -----Original Message-----
>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>> Sent: Wednesday, December 02, 2009 6:00 AM
>> To: pacemaker at oss.clusterlabs.org
>> Subject: Re: [Pacemaker] bug in ordering syntax?
>>
>> On Mon, Nov 30, 2009 at 9:19 PM, Frank DiMeo
>> <Frank.DiMeo at bigbandnet.com> wrote:
>> > I'm experimenting with startup sequence and co-location control, and
>> think I
>> > may have stumbled across a bug.
>> >
>> >
>> >
>> > I have two xml files that I use in my testing as my initial
>> configuration of
>> > a two node cluster.  I start each node with no configuration, and
>> then use
>> > cibadmin to "source in" the xml file.  Each file defines two
>> resources as
>> > well as a startup order and collocation definition.  The only
>> difference
>> > between the two files is the syntax I use to specify the startup
>> order.
>> >
>> >
>> >
>> > When I use the syntax:
>> >
>> >
>> >
>> > <rsc_order id="order-1" first="world1" then="world2" score="INFINITY"
>> />
>> >
>> >
>> >
>> > Everything works fine.  I can put either of the two nodes into
>> standby while
>> > resources are running there, and the resources move to the other node
>> as
>> > expected.
>> >
>> >
>> >
>> > However, when I use the syntax:
>> >
>> >
>> >
>> > - <<rsc_order id="order-1">
>>
>> You're missing a score.  Without one it defaults to 0 (which means
>> optional).
>> However, IIRC, the 1.0.6 schema won't allow you to set a score there
>> so you'll need to apply the following patch:
>>    http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/c8585629629c
>>
>> >
>> > - <  <resource_set id="order-1-set-1" sequential="true">
>> >
>> >   <            <resource_ref id="world1" />
>> >
>> >   <            <resource_ref id="world2" />
>> >
>> >   </resource_set>
>> >
>> >  </rsc_order>
>> >
>> >
>> >
>> >
>> >
>> > Several bad things happen.  First, the resources don't move off the
>> node
>> > that is put into standby, even though the alternate node is running
>> and able
>> > to run the resources.
>>
>> Did you remove the other ordering constraint first?
>>
>> > Second, attempting to shut down openais on the node
>> > running the resources after attempting a forced move (by putting the
>> node
>> > into standby) leaves both the lrmd and pengine processes running (but
>> > children of process 1 (init), and the resources continue to run on
>> the that
>> > node even after openais is stopped.
>>
>> I suspect you've a faulty init script there.  See other email.
>>
>> > I turned debug on in crmd and in the logs and recorded what happens
>> when I
>> > force standby, and I notice that using the first syntax causes
>> > te_rsc_command to be executed to send a shut down message to the node
>> where
>> > the resources are running (which seems to work), while using the
>> second
>> > syntax causes te_pseudo_action to be called in approximately the same
>> place
>> > in the log, but no shutdown of resources happens (I can't really tell
>> what
>> > this is supposed to be doing).
>>
>> Neither can I - you didnt attach the logs :-)
>>
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>




More information about the Pacemaker mailing list