[ClusterLabs] Antw: changing on-fail action default

Thu Oct 19 10:21:10 EDT 2017

On Thu, 2017-10-19 at 17:08 +0900, Christian Balzer wrote:
> On Thu, 19 Oct 2017 09:57:31 +0200 Ulrich Windl wrote:
> 
> > > > > Nikola Ciprich <nikola.ciprich at linuxbox.cz> schrieb am
> > > > > 19.10.2017 um 09:46 in  
> > 
> > Nachricht <20171019074630.GA23856 at pcnci.linuxbox.cz>:
> > > Hi fellow pacemaker users,  
> > 
> > Hi!
> > 
> > > 
> > > I'd like to ask, if it is possible to change on-fail default
> > > action.
> > > I don't want it to be "fence" but "block" even for clusters with
> > > fencing.  

Yes, there is a rsc_defaults section to set defaults for resource
attributes, and an op_defaults section to set defaults for resource
operation attributes (including on-fail). The command you use to set
them depends on what tool you're using (e.g., with pcs, see "pcs
resource op defaults").

> > 
> > This would mean any cluster with a problem would require manual
> > intervention!
> > 
> > > 
> > > but I don't want to have to change it for each resource..
> > > 
> > > is it possible to set global default?  
> > 
> > See above. Still I understand what you are asking for. What's
> > missing in pacemaker is a "time to fix the mess" interval (I
> > vaguely remember HP-UX ServiceGuard had such a thing): So if the
> > cluster detects a problem that would cause a node fencing, the
> > cluster waits whether things change within some seconds or minutes,
> > and then (if things are still bad) the node is fenced. However if
> > the reason for fencing is no longer there, no fencing will be
> > done...

You can set delays on fence actions. Some fence agents have delay
parameters themselves, or you can set it at the pacemaker level with
pcmk_max_delay (for a random delay) or (with the forthcoming 1.1.18)
pcmk_delay_base (for a fixed delay). So, you could even set
pcmk_delay_base=60m to wait an hour before executing fencing.

> > As far as I understand pacemaker, a fencing request cannot be
> > revoked once issued  (it's in the queue of actions).

Correct, so even with the delay, fencing would eventually be done, but
it would give you time to investigate and prepare for the shutdown.

There has been some discussion recently about allowing fencing to be
cancelled under certain situations. The easiest to implement would be
to be able to cancel fencing if it's in the delay period (so, no
commands have been sent yet to any devices). The idea that was
discussed was to cancel any delayed operations when a fence device is
disabled in the configuration.

> Yeah, that strict sequential operating can be a major PITA,
> especially if
> the reason for whatever action has long gone.
> 
> Christian

Even with any of the above suggestions, there will always have to be a
strictness about fencing before recovery. If the cluster can't
communicate with the node, fencing is the only way to be sure it's
unable to cause conflicts.

But, it's fine for "fencing" to be manual, i.e. having an admin
manually investigate, reboot the machine, and use stonith_admin --
confirm to say that fencing has been done.
-- 
Ken Gaillot <kgaillot at redhat.com>