[Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

Mon May 25 16:11:19 UTC 2009

On 2009-05-25 17:45, Andrew Beekhof wrote:
> SUSE is currently recommending NIC bonding.
> We've not been able to get satisfactory behavior from clusters using RRP.

I've repeatedly told customers that NIC bonding is not a valid
substitute for redundant Heartbeat links, I will stubbornly insist it
isn't one for OpenAIS RRP links either.

Some reasons:
- You're not protected against bugs, currently known or unknown, in the
bonding driver. If bonding itself breaks, you're screwed.
- Most people actually run bonding over interfaces over the same make,
model, and chipset. That's not necessarily optimal, but it's a reality.
Thus, if your driver breaks, you're screwed again. Granted, this is
probably to if you ran two RRP links in that same configuration too.
- Finally, you can't bond between a switched and a direct back-to-back
connection, which makes bonding entirely unsuitable for the redundant
links use case I described earlier.

>> 1. Set rrp_problem_count_timeout and/or rrp_problem_count_threshold
>> ridiculously high so the ring status never goes to faulty. (It seems
>> that RRP "problem counting" can't be disabled altogether).
>>
>> 2. Have package maintainers include some magic that does
>> "openais-cfgtool -r" every time a network link changes its status to UP
>> (where the network management subsystem permits this).
>>
>> 3. Instruct users to install cron jobs that do "openais-cfgtool -r" in
>> specified intervals, causing OpenAIS to re-check the link status
>> periodically.
> 
> You could add it to the drbd monitor action I guess.
> But it does seem sub-optimal.

I already made my point with regard to Juha's suggestion that it seems
odd for Pacemaker to fiddle with its own communication infrastructure.
To instead defer that task to a Pacemaker resource agent seems
positively disturbing.

> I think the best solution is to work with upstream to get the feature
> working properly.

That I fully agree with. The question is what "working properly" means
in this case -- should it be capable of auto-recovery, or should it not?

Cheers,
Florian