[Pacemaker] Re: [PATCH] election trigger

Wed Nov 5 09:11:49 EST 2008

On Wednesday 05 November 2008 14:46:14 Andrew Beekhof wrote:
> On Nov 5, 2008, at 2:26 PM, Bernd Schubert wrote:
> > Hello Andrew,
> >
> > sorry for my late response.
> >
> > "Look here, you didn't set dc_deadtime, so crm is going to use a
> > huge useless
> > timeout".
>
> Yeah, but eventually you looked at the code and proposed the patch :-)

Well yes, but don't ask how much time it took to look at the right code. I was 
first checking hearbeat and didn't understand what was the problem, since 
additional logs showed that actually everything was online. Until I found out 
crmd is entirely independent...

>
> > But instead on each startup of heartbeat I get hundreds of lines
> > into syslog
> > and all of these don't look as if there are for the common admin,
> > but IMHO
> > 99% of it are developer information.
>
> yeah :-(
> the logging is a _lot_ better in 1.0, but could still be improved.

Sounds good. I really need to test it.

>
> at the cluster summit in prague we also agreed on a "black box"
> recorder that should help too.
> this way we can log tracing details there and only dump it into the
> logs (or recover it from core files) when needed.
>
> but this will live in corosync, so it wont help people running on
> heartbeat.

Well, if openais + corosysnc are better, we can try to switch to it.

>
> > Then after I found the code in pacemaker, I already tested setting
> > dc_deatime,
> > but during my initial test that didn't change anything. While we
> > need for
> > Lustre installations a heartbeat deadtime > 10min, I set it on my test
> > systems to 180s.
> > Now after your suggestion I tested it again, with deadtime=20min, but
> > dc_deatime=10s and quite odd, crm still needs about 3min to set the
> > nodes
> > online (syslog attached). With the code removed it is only 10s.
>
> Hmmm - thats odd - i'll take a look.

Thanks, I will also try to find some time to look at it again.

>
> > Since openais doesn't seem to support the code below at at all and
> > since it is
> > wrong when used together with heartbeat, I still think removing
> > these lines
> > is right. Please correct me if I'm wrong.
>
> I'd prefer to fix the logic (if it's broken) since it's likely that
> we'd add an equivalent default mechanism for CoroSync eventually.

I just don't understand why we need that mechanism at all. I mean if 
heartbeat/corosync/openais detect everything is online, why does pacemaker 
need its own start timeout again? Shouldn't it try to online everything as 
soon as it is started? Well, ok it needs a timeout to detect if other nodes 
already have a DC. But then the DC detection timeout is not related at all to 
node deadtime detection, is it?

>
> > PS: Sorry, the attached syslog is still with heartbeat-2.1.4. If you
> > think you
> > fixed it in pacemaker already, please point me to the commit.
>
> No, this area doesn't get updated much (because it mostly works)

Ok, thanks. So I can concentrate on finding the real issue.

Thanks a lot for your help!

-- 
Bernd Schubert
Q-Leap Networks GmbH