[Pacemaker] Question regarding starting of master/slave resources and ELECTIONs

Thu Apr 14 23:58:11 EDT 2011

Andrew,

Thanks for the help

Comments inline with <BS>

________________________________
From: Andrew Beekhof <andrew at beekhof.net>
To: Bob Schatz <bschatz at yahoo.com>
Cc: The Pacemaker cluster resource manager <pacemaker at oss.clusterlabs.org>
Sent: Thu, April 14, 2011 2:14:40 AM
Subject: Re: [Pacemaker] Question regarding starting of master/slave resources 
and ELECTIONs

On Thu, Apr 14, 2011 at 10:49 AM, Andrew Beekhof <andrew at beekhof.net> wrote:

>>> I noticed that 4 of the master/slave resources will start right away but
>>> the
>>> 5 master/slave resource seems to take a minute or so and I am only running
>>> with one node.
>>> Is this expected?
>>
>> Probably, if the other 4 take around a minute each to start.
>> There is an lrmd config variable that controls how much parallelism it
>> allows (but i forget the name).
>> <Bob> It's max-children and I set it to 40 for this test to see if it would
>> change the behavior.  (/sbin/lrmadmin -p max-children 40)
>
> Thats surprising.  I'll have a look at the logs.

Looking at the logs, I see a couple of things:

This is very bad:
Apr 12 19:33:42 mgraid-S000030311-1 crmd: [17529]: WARN: get_uuid:
Could not calculate UUID for mgraid-s000030311-0
Apr 12 19:33:42 mgraid-S000030311-1 crmd: [17529]: WARN:
populate_cib_nodes_ha: Node mgraid-s000030311-0: no uuid found

For some reason pacemaker cant get the node's uuid from heartbeat.

<BS> I create the uuid when the node comes up.

So we start a few things:

Apr 12 19:33:41 mgraid-S000030311-1 crmd: [17529]: info:
do_lrm_rsc_op: Performing
key=23:3:0:48aac631-8177-4cda-94ea-48dfa9b1a90f
op=SSS000030311:0_start_0 )
Apr 12 19:33:41 mgraid-S000030311-1 crmd: [17529]: info:
do_lrm_rsc_op: Performing
key=49:3:0:48aac631-8177-4cda-94ea-48dfa9b1a90f
op=SSJ000030312:0_start_0 )
Apr 12 19:33:41 mgraid-S000030311-1 crmd: [17529]: info:
do_lrm_rsc_op: Performing
key=75:3:0:48aac631-8177-4cda-94ea-48dfa9b1a90f
op=SSJ000030313:0_start_0 )
Apr 12 19:33:41 mgraid-S000030311-1 crmd: [17529]: info:
do_lrm_rsc_op: Performing
key=101:3:0:48aac631-8177-4cda-94ea-48dfa9b1a90f
op=SSJ000030314:0_start_0 )

But then another change comes in:

Apr 12 19:33:41 mgraid-S000030311-1 crmd: [17529]: info:
abort_transition_graph: need_abort:59 - Triggered transition abort
(complete=0) : Non-status change

Normally we'd recompute and keep going, but it was a(nother) replace
operation, so:

Apr 12 19:33:42 mgraid-S000030311-1 crmd: [17529]: info:
do_state_transition: State transition S_TRANSITION_ENGINE ->
S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL
origin=do_cib_replaced ]

All the time goes here:

Apr 12 19:35:31 mgraid-S000030311-1 crmd: [17529]: WARN:
action_timer_callback: Timer popped (timeout=20000,
abort_level=1000000, complete=true)
Apr 12 19:35:31 mgraid-S000030311-1 crmd: [17529]: WARN:
action_timer_callback: Ignoring timeout while not in transition
Apr 12 19:35:31 mgraid-S000030311-1 crmd: [17529]: WARN:
action_timer_callback: Timer popped (timeout=20000,
abort_level=1000000, complete=true)
Apr 12 19:35:31 mgraid-S000030311-1 crmd: [17529]: WARN:
action_timer_callback: Ignoring timeout while not in transition
Apr 12 19:35:31 mgraid-S000030311-1 crmd: [17529]: WARN:
action_timer_callback: Timer popped (timeout=20000,
abort_level=1000000, complete=true)
Apr 12 19:35:31 mgraid-S000030311-1 crmd: [17529]: WARN:
action_timer_callback: Ignoring timeout while not in transition
Apr 12 19:35:31 mgraid-S000030311-1 crmd: [17529]: WARN:
action_timer_callback: Timer popped (timeout=20000,
abort_level=1000000, complete=true)
Apr 12 19:35:31 mgraid-S000030311-1 crmd: [17529]: WARN:
action_timer_callback: Ignoring timeout while not in transition
Apr 12 19:37:00 mgraid-S000030311-1 crmd: [17529]: ERROR:
crm_timer_popped: Integration Timer (I_INTEGRATED) just popped!

but its not at all clear to me why - although certainly avoiding the
election would help.
Is there any chance to load all the changes at once?

<BS> Yes.  That worked.  I created the configuration in a file and then did a 
"crm configure load update <filename>" to avoid the election

Possibly the delay related to the UUID issue above, possibly it might
be related to one of these two patches that went in after 1.0.9

andrew (stable-1.0)    High: crmd: Make sure we always poke the FSA after
a transition to clear any TE_HALT actions CS: 9187c0506fd3 On:
2010-07-07
andrew (stable-1.0)    High: crmd: Reschedule the PE_START action if its
not already running when we try to use it CS: e44dfe49e448 On:
2010-11-11

Could you try turning on debug and/or a more recent version?

<BS>  I turned on debug and grabbed the logs, configuration and and 
/var/lib/pengine directory.   They are attached.
     Unfortunately I cannot try a new version with this hardware at this time. 
:(

Thanks,

Bob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20110414/75a202b9/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: debug.tar.gz
Type: application/x-gzip
Size: 174320 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20110414/75a202b9/attachment-0001.bin>