[Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

Andrew Beekhof andrew at beekhof.net
Mon May 25 12:16:02 EDT 2009


On Mon, May 25, 2009 at 6:10 PM, Florian Haas <florian.haas at linbit.com> wrote:
> On 2009-05-25 17:45, Andrew Beekhof wrote:
>> SUSE is currently recommending NIC bonding.
>> We've not been able to get satisfactory behavior from clusters using RRP.
>
> I've repeatedly told customers that NIC bonding is not a valid
> substitute for redundant Heartbeat links, I will stubbornly insist it
> isn't one for OpenAIS RRP links either.
>
> Some reasons:
> - You're not protected against bugs, currently known or unknown, in the
> bonding driver. If bonding itself breaks, you're screwed.
> - Most people actually run bonding over interfaces over the same make,
> model, and chipset. That's not necessarily optimal, but it's a reality.
> Thus, if your driver breaks, you're screwed again. Granted, this is
> probably to if you ran two RRP links in that same configuration too.
> - Finally, you can't bond between a switched and a direct back-to-back
> connection, which makes bonding entirely unsuitable for the redundant
> links use case I described earlier.
>
>>> 1. Set rrp_problem_count_timeout and/or rrp_problem_count_threshold
>>> ridiculously high so the ring status never goes to faulty. (It seems
>>> that RRP "problem counting" can't be disabled altogether).
>>>
>>> 2. Have package maintainers include some magic that does
>>> "openais-cfgtool -r" every time a network link changes its status to UP
>>> (where the network management subsystem permits this).
>>>
>>> 3. Instruct users to install cron jobs that do "openais-cfgtool -r" in
>>> specified intervals, causing OpenAIS to re-check the link status
>>> periodically.
>>
>> You could add it to the drbd monitor action I guess.
>> But it does seem sub-optimal.
>
> I already made my point with regard to Juha's suggestion that it seems
> odd for Pacemaker to fiddle with its own communication infrastructure.

Agreed so far.

> To instead defer that task to a Pacemaker resource agent seems
> positively disturbing.

No more disturbing than #2 and what are the recurring monitor
operations if not a "cron" job?

>> I think the best solution is to work with upstream to get the feature
>> working properly.
>
> That I fully agree with. The question is what "working properly" means
> in this case -- should it be capable of auto-recovery, or should it not?

Absolutely.  Its both pointless and useless if it doesn't.




More information about the Pacemaker mailing list