[Pacemaker] how to get pacemaker:ping recheck before promoting drbd resources on a node

Wed Apr 27 05:03:09 EDT 2011

On Mon, Apr 18, 2011 at 08:57:11PM +0200, Jelle de Jong wrote:
> Hello everybody,
> 
> I need to be able to bring down my network interface (network failure
> test) and few seconds later bring it up again. Without my drbd cluster
> going nuts and creating split brains.
> 
> I was advised to use ocf:pacemaker:ping, so I started to integrate this
> in my configuration: http://pastebin.com/raw.php?i=iyp3URkP
> 
> Now the problem is that it kind of works, but not the way I need it to be.
> 
> The ping status is not rechecked right _before_ it tries to promoted the
> drbd resources. If should do a fast ping check and continue if
> successful but _don’t_ promote any drbd resources when it stalls or fails.
> 
> The problem is that the ping have been returning good values back until
> the network failure and when the failure accrues it is still thinking
> the ping status is good and promotes the disk until and few seconds
> later the ping status changes to indicate the network failure, but then
> all damage is already made...
> 
> I must be doing something _terrible_ wrong since I can't believe a
> pacemaker/corosync cluster shouldn't be able to survive a network glitch
> (short network failures) without all kind of split brains and losing the
> node.

You only have one communication channel.

Try with redundant rings.

If corosync redundant rings do not work for you,
try heartbeat with multiple communication links.

That way, if just one communication channel is down, the cluster won't
need to do it's full "recovery after node failure".

BTW, when using a shared disk (as opposed to DRBD), you'd really have to
use stonith, and any "felt" or real node failure would necessarily lead
to stonith. Just think about that.

Then reconsider your expectations about what the cluster should do if
the communication layer declares a node dead.
And see how you can improve your setup to avoid spurious "node dead"
events.  As already mentioned, usually the probability of false
positives (or is it false negatives here?) can be reduced by using
redundant communication channels.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.