[Pacemaker] how to get pacemaker:ping recheck before promoting drbd resources on a node

Tue Apr 19 05:31:54 EDT 2011

On Mon, Apr 18, 2011 at 8:57 PM, Jelle de Jong
<jelledejong at powercraft.nl> wrote:
> Hello everybody,
>
> I need to be able to bring down my network interface (network failure
> test) and few seconds later bring it up again. Without my drbd cluster
> going nuts and creating split brains.
>
> I was advised to use ocf:pacemaker:ping, so I started to integrate this
> in my configuration: http://pastebin.com/raw.php?i=iyp3URkP

It the underlying messaging/membership layer goes into spasms -
there's not much ping can do to help you.
What version of corosync have you got?  Some versions have been better
than others.

> Now the problem is that it kind of works, but not the way I need it to be.
>
> The ping status is not rechecked right _before_ it tries to promoted the
> drbd resources.

Correct, its checked periodically.

> If should do a fast ping check and continue if
> successful but _don’t_ promote any drbd resources when it stalls or fails.

That is something that would be needed to be added to the drbd agent.
Alternatively, configure the ping resource to update more frequently.

>
> The problem is that the ping have been returning good values back until
> the network failure and when the failure accrues it is still thinking
> the ping status is good and promotes the disk until and few seconds
> later the ping status changes to indicate the network failure, but then
> all damage is already made...
>
> I must be doing something _terrible_ wrong since I can't believe a
> pacemaker/corosync cluster shouldn't be able to survive a network glitch
> (short network failures) without all kind of split brains and losing the
> node.

But you did loose the node.
The cluster can't see into the future to know that it will come back in a bit.

What token timeouts are you using?