[Pacemaker] how to get pacemaker:ping recheck before promoting drbd resources on a node

Tue Apr 19 08:52:18 EDT 2011

On Tue, Apr 19, 2011 at 11:54 AM, Jelle de Jong
<jelledejong at powercraft.nl> wrote:
> On 19-04-11 11:31, Andrew Beekhof wrote:
>> It the underlying messaging/membership layer goes into spasms -
>> there's not much ping can do to help you. What version of corosync
>> have you got?  Some versions have been better than others.
>
> corosync 1.2.1-4
> pacemaker 1.0.9.1+hg15626-1
> /etc/debian_version 6.0.1 (stable)
>
>> Correct, its checked periodically.
>
> Can I change the config that a ping check is done before promoting drbd?

No. As I said, you'd need to add this to the agent itself.
We just make sure things are in a certain state before
starting/promoting other resources - we don't call specific actions.

>
> I tried adding a seperate ping0: http://pastebin.com/raw.php?i=2WD1HKnC
> I thought it worked but ping0 starts and drbd is still promoted probably
> because ping0 returns a successful start but does not return an error
> because the actual ping failed. So I tried adding additonal location
> rules for ping0 but then the resources is not started at anymore:
> http://pastebin.com/raw.php?i=DXqRzMNs
>
>> That is something that would be needed to be added to the drbd
>> agent. Alternatively, configure the ping resource to update more
>> frequently.
>
> How can this be done? crm ra info ocf:ping doesn't show much info. I
> tried using attempts="1" dampen="1" timeout="1" and monitor
> interval="1". An example how to do frequent fast ping would be welcome.

a monitor with interval=1, timeout=1 and dampen=0 should give the
closest behavior to what you're after.
make sure interval is not a parameter though.

>
> If I cam make the ping check fast enough to detect network failures
> before corosync tell pacemaker the other node disappears/failed this may
> provide a workaround solution.
>
>> But you did loose the node. The cluster can't see into the future to
>> know that it will come back in a bit. What token timeouts are you
>> using?
>
> True, but the node should see his own network is down and see he is the
> one that was failing and wait until his network is back and check his
> situation again before doing things with his resources.

The cluster does not understand the network topology in the way you do

> My corosync.conf with token 3000: http://pastebin.com/Y5Lkf4Ch

Increasing that will tell the cluster to wait a bit longer before
declaring a node dead.

>
> Thanks in advance,
>
> Any help is much appreciated,
>
> Kind regards,
>
> Jelle de Jong
>