[Pacemaker] Re: [PATCH] election trigger

Andrew Beekhof beekhof at gmail.com
Wed Nov 5 08:46:14 EST 2008

On Nov 5, 2008, at 2:26 PM, Bernd Schubert wrote:

> Hello Andrew,
> sorry for my late response.
> On Sunday 02 November 2008 20:32:14 Andrew Beekhof wrote:
>> On Oct 30, 2008, at 6:08 PM, Bernd Schubert wrote:
>>> Heartbeat calls crmd only if all nodes are already online.
>> Not everyone uses it on heartbeat anymore ;-)
> I grepped the sources of openais and corosync for "KEY_INITDEAD",  
> but can't
> find anything.

Correct.  For those two the sanity logic doesn't achieve anything.

> Are there any further solutions pacemaker supports?

Just those three.

>>> So introducing
>>> another posssibly huge deadtime here will at least delay the DC
>>> selection
>>> and so resource startup by heartbeats initial deadtime. If one node
>>> e.g.
>>> after a global power failure doesn't come up at all, the DC
>>> selection was
>>> even delayed by 2 x initial hb deadtime. Simply remove the usage of
>>> heartbeats initial deadtime and only use our own.
>> I don't understand.
>> The logic below is only triggered for people who haven't set a value
>> for dc_deadtime... why not just set a value in the cib?
> Well firstly, the logs didn't tell me:
> "Look here, you didn't set dc_deadtime, so crm is going to use a  
> huge useless
> timeout".

Yeah, but eventually you looked at the code and proposed the patch :-)

> But instead on each startup of heartbeat I get hundreds of lines  
> into syslog
> and all of these don't look as if there are for the common admin,  
> but IMHO
> 99% of it are developer information.

yeah :-(
the logging is a _lot_ better in 1.0, but could still be improved.

at the cluster summit in prague we also agreed on a "black box"  
recorder that should help too.
this way we can log tracing details there and only dump it into the  
logs (or recover it from core files) when needed.

but this will live in corosync, so it wont help people running on  

> Then after I found the code in pacemaker, I already tested setting  
> dc_deatime,
> but during my initial test that didn't change anything. While we  
> need for
> Lustre installations a heartbeat deadtime > 10min, I set it on my test
> systems to 180s.
> Now after your suggestion I tested it again, with deadtime=20min, but
> dc_deatime=10s and quite odd, crm still needs about 3min to set the  
> nodes
> online (syslog attached). With the code removed it is only 10s.

Hmmm - thats odd - i'll take a look.

> Since openais doesn't seem to support the code below at at all and  
> since it is
> wrong when used together with heartbeat, I still think removing  
> these lines
> is right. Please correct me if I'm wrong.

I'd prefer to fix the logic (if it's broken) since it's likely that  
we'd add an equivalent default mechanism for CoroSync eventually.

> PS: Sorry, the attached syslog is still with heartbeat-2.1.4. If you  
> think you
> fixed it in pacemaker already, please point me to the commit.

No, this area doesn't get updated much (because it mostly works)

More information about the Pacemaker mailing list