[Pacemaker] Re: [PATCH] election trigger

Bernd Schubert bs at q-leap.de
Wed Nov 5 12:48:25 EST 2008


Great, thanks!

On Wednesday 05 November 2008 16:46:59 Andrew Beekhof wrote:
> On second thoughts - patch applied.
> Having two levels of defaults makes no sense.
>
> On Wed, Nov 5, 2008 at 16:36, Andrew Beekhof <beekhof at gmail.com> wrote:
> > On Nov 5, 2008, at 3:11 PM, Bernd Schubert wrote:
> >>> at the cluster summit in prague we also agreed on a "black box"
> >>> recorder that should help too.
> >>> this way we can log tracing details there and only dump it into the
> >>> logs (or recover it from core files) when needed.
> >>>
> >>> but this will live in corosync, so it wont help people running on
> >>> heartbeat.
> >>
> >> Well, if openais + corosysnc are better, we can try to switch to it.
> >
> > note the future tense there though... its not implemented yet.
> >
> >>>> Then after I found the code in pacemaker, I already tested setting
> >>>> dc_deatime,
> >>>> but during my initial test that didn't change anything. While we
> >>>> need for
> >>>> Lustre installations a heartbeat deadtime > 10min, I set it on my test
> >>>> systems to 180s.
> >>>> Now after your suggestion I tested it again, with deadtime=20min, but
> >>>> dc_deatime=10s and quite odd, crm still needs about 3min to set the
> >>>> nodes
> >>>> online (syslog attached). With the code removed it is only 10s.
> >>>
> >>> Hmmm - thats odd - i'll take a look.
> >>
> >> Thanks, I will also try to find some time to look at it again.
> >>
> >>>> Since openais doesn't seem to support the code below at at all and
> >>>> since it is
> >>>> wrong when used together with heartbeat, I still think removing
> >>>> these lines
> >>>> is right. Please correct me if I'm wrong.
> >>>
> >>> I'd prefer to fix the logic (if it's broken) since it's likely that
> >>> we'd add an equivalent default mechanism for CoroSync eventually.
> >>
> >> I just don't understand why we need that mechanism at all. I mean if
> >> heartbeat/corosync/openais detect everything
> >
> > Especially with autojoin, it doesn't know that "everything" is online.
> > There could be some extra nodes about to start/join the cluster.
> >
> > Remember, this is only supposed to supply a default value.
> > Advanced users are free to set it as low as they like.
> >
> > Of course they need to know they can - thats a documentation issue which
> > can be easily rectified.
> >
> >> is online, why does pacemaker need its own start timeout again?
> >
> > because it needs to give any existing DC a chance to contact it rather
> > than needlessly causing another DC election.
> >
> >> Shouldn't it try to online everything as
> >> soon as it is started? Well, ok it needs a timeout to detect if other
> >> nodes
> >> already have a DC.
> >
> > exactly.  so any value should only be used by the first node to come up.
> > is that what you're seeing?
> >
> >> But then the DC detection timeout is not related at all to
> >> node deadtime detection, is it?
> >
> > at the time it was felt that they were related enough that it made the
> > basis of a good default.



-- 
Bernd Schubert
Q-Leap Networks GmbH




More information about the Pacemaker mailing list