[ClusterLabs] SBD stonith in 2 node cluster - how to make it prefer one side of cluster?

Sat Nov 25 10:38:54 EST 2017

On 11/25/2017 07:52 AM, Andrei Borzenkov wrote:
> Wrapping my head around how pcmk_delay_max works, my understanding is
>
> - on startup pacemaker always starts one instance of stonith/sbd; it
> probably randomly selects node for it. I suppose this initial start is
> delayed by random number within pcmk_delay_max.
>
> - when cluster is partitioned, pacemaker *also* starts one instance of
> stonith/sbd in each partition where it is not yet running. This startup
> is also delayed by random number within pcmk_delay_max.
>
> - this makes partition that already has stonith/sbd running win race for
> kill request
>
> Is my understanding correct?

No, pcmk_delay_max delays the stonith-action (like 'reboot') itself
by a random time between 0 and the given value. (btw. meanwhile
there is pcmk_delay_base as well that gives you the possibility to
modify the minimum wait here as well)
If a fencing-device is actually really started or not shouldn't make
much difference. iirc it just tells if (when configured) a periodic
monitoring is done. Pacemaker is going to use fencing-devices
that aren't started as well.
What should make one node win the race instead of both shooting
each other is the randomness in the delay.

Mind that it might look as if a fencing-device would have to be
started to be used because of a bug in pacemaker that was fixed
just recently. But that definitely isn't a behavior you shouldn't
rely on.
Speaking of that thing iirc:

commit 4ad87e494d56e18bcaafa058554573f890517eed
Author: Klaus Wenninger <klaus.wenninger at aon.at>
Date:   Fri Jul 21 17:57:48 2017 +0200

    Fix: stonith-ng: make fencing-device reappear properly after reenabling

>
> If yes, consider two node cluster where one application is more
> important than the other. The obvious example is replicated database -
> in case of split brain we want to preserve node with primary as it
> likely has active connections.
>
> Would using advisory colocation constraint between application and
> stonith/sbd work? Let's consider (using crmsh notation)
>
> primitive my_database
> ms my_replicated_database my_database
> primitive fencing_sbd stonith:external/sbd params pcmk_delay_max=15
> colocation prefer_primary 10: fencing_sbd my_replicated_database:Master
>
> It is going to work?
>
> It should work on startup, as it simply affects where sbd resource is
> placed initially and pacemaker need to make this decision anyway.
>
> I expect it to work if my_primary_database master moves to another node
> - pacemaker should move sbd resource too, right? It does add small
> window where no stonith agent is running, but as I understand pacemaker
> is going to start it anyway in case of split brain, so in the worst case
> non-preferred node will be fenced, which is not worse than what we have
> already.
>
> What I am not sure is what happens during split brain. Will colocation
> affect pacemaker decision to start another copy of sbd resource on
> another partitioned node? I hope not, as it is advisory so it should
> still use the only available node left in this case?
>
> Does it all make sense? Anyone has used it in real life?

As said if a fencing-resource is basically runable on a node
you can't generically prevent it from being used by rules.
That is basically the reason as well why you don't have to
clone your fencing-primitive.
Although you can disable it on a node using -inf-rules e.g.
or you set the target-role=Stopped to globally disable it.
More complicated rules will get evaluated as well but that
is not reliable as the outcome might change over time and
stonithd isn't triggered to do a reevaluation.
What you can do is have dedicated active/standby-nodes
and 2 explicit sbd-fencing-resources located to fixed nodes
with a higher delay (fixed value) on the standby-node.

If you want to have something dynamic you can pick up
the idea behind fence_heuristics_ping (recently added
to fence-agents-package).
The basic idea is outlined in the description:
"fence_heuristics_ping uses ping-heuristics to control
execution of another fence agent on the same fencing
level."
That is based on a fencing-level being executed in a
strict order aborting with failure if one of the
devices fails.
While this one is designed to totally prevent a node
with bad uplink from fencing the peer, you can of
course as well have something custom that just
introduces a delay based on whatever.

Regards,
Klaus
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org