[Pacemaker] Re: [PATCH] election trigger

Wed Nov 5 10:46:59 EST 2008

On second thoughts - patch applied.
Having two levels of defaults makes no sense.

On Wed, Nov 5, 2008 at 16:36, Andrew Beekhof <beekhof at gmail.com> wrote:
>
> On Nov 5, 2008, at 3:11 PM, Bernd Schubert wrote:
>
>>>
>>> at the cluster summit in prague we also agreed on a "black box"
>>> recorder that should help too.
>>> this way we can log tracing details there and only dump it into the
>>> logs (or recover it from core files) when needed.
>>>
>>> but this will live in corosync, so it wont help people running on
>>> heartbeat.
>>
>> Well, if openais + corosysnc are better, we can try to switch to it.
>
> note the future tense there though... its not implemented yet.
>
>>
>>
>>>
>>>> Then after I found the code in pacemaker, I already tested setting
>>>> dc_deatime,
>>>> but during my initial test that didn't change anything. While we
>>>> need for
>>>> Lustre installations a heartbeat deadtime > 10min, I set it on my test
>>>> systems to 180s.
>>>> Now after your suggestion I tested it again, with deadtime=20min, but
>>>> dc_deatime=10s and quite odd, crm still needs about 3min to set the
>>>> nodes
>>>> online (syslog attached). With the code removed it is only 10s.
>>>
>>> Hmmm - thats odd - i'll take a look.
>>
>> Thanks, I will also try to find some time to look at it again.
>>
>>>
>>>> Since openais doesn't seem to support the code below at at all and
>>>> since it is
>>>> wrong when used together with heartbeat, I still think removing
>>>> these lines
>>>> is right. Please correct me if I'm wrong.
>>>
>>> I'd prefer to fix the logic (if it's broken) since it's likely that
>>> we'd add an equivalent default mechanism for CoroSync eventually.
>>
>> I just don't understand why we need that mechanism at all. I mean if
>> heartbeat/corosync/openais detect everything
>
> Especially with autojoin, it doesn't know that "everything" is online.
> There could be some extra nodes about to start/join the cluster.
>
> Remember, this is only supposed to supply a default value.
> Advanced users are free to set it as low as they like.
>
> Of course they need to know they can - thats a documentation issue which can
> be easily rectified.
>
>> is online, why does pacemaker need its own start timeout again?
>
> because it needs to give any existing DC a chance to contact it rather than
> needlessly causing another DC election.
>
>> Shouldn't it try to online everything as
>> soon as it is started? Well, ok it needs a timeout to detect if other
>> nodes
>> already have a DC.
>
> exactly.  so any value should only be used by the first node to come up.
> is that what you're seeing?
>
>> But then the DC detection timeout is not related at all to
>> node deadtime detection, is it?
>
> at the time it was felt that they were related enough that it made the basis
> of a good default.
>
>