[ClusterLabs] Completely disabled resource failure triggered fencing

Mon Jan 18 14:02:26 EST 2021

On Mon, 2021-01-18 at 13:01 +0000, Strahil Nikolov wrote:
> Have you tried on-fail=ignore option ?
> 
> Best Regards,
> Strahil Nikolov

on-fail=ignore will act as if the operation succeeded, which probably
isn't desired here. It's usually used for flaky/buggy devices/agents
that sometimes (or always) report failure for successful starts or
monitors, or for noncritical resources where a monitor failure is
interesting (in status displays) but it's not worth doing anything
about.

on-fail=block does make more sense (essentially it means "wait for a
human to look into it"). Also I'm not sure whether on-fail=ignore is
allowed for stop.

> 
> 
> 
> 
> 
> В неделя, 17 януари 2021 г., 20:45:27 Гринуич+2, Digimer <
> lists at alteeve.ca> написа: 
> 
> 
> 
> 
> 
> Hi all,
> 
>   I'm trying to figure out how to define a resource such that if it
> fails in any way, it will not cause pacemaker self self-fence. The
> reasoning being that there are relatively minor ways to fault a
> single
> resource (these are VMs, so for example, a bad edit to the XML
> definition renders it invalid, or the definition is accidentally
> removed).
> 
> In a case like this, I fully expect that resource to enter a failed
> state. Of course, pacemaker won't be able to stop it, migrate it,
> etc.
> When this happens currently, it causes the host to self-fence, taking
> down all other hosted resources (servers). This is less than ideal.
> 
> Is there a way to tell pacemaker that if it's unable to manage a
> resource, it flags it as failed and leaves it at that? I've been
> trying
> to do this and my config so far is;
> 
> pcs resource create srv07-el6 ocf:alteeve:server name="srv07-el6" \
> meta allow-migrate="true" target-role="stopped" \
> op monitor interval="60" start timeout="INFINITY" \
> on-fail="block" stop timeout="INFINITY" on-fail="block" \
> migrate_to timeout="INFINITY"
> 
> This is getting cumbersome and still, in testing, I'm finding cases
> where the node gets fenced when something breaks the resource in a
> creative way.

I'd expect the above to work. As discussed in the other thread, one
case where it can't work is when it's not there. :) If you've found
some other way where it doesn't work as expected, let me know. (Of
course, there's also the separate possibility of node failure, manual
or DLM-initiated fencing, etc. but I'm sure you're familiar with all
that.)

> 
> Thanks for any insight/guidance!
> 
-- 
Ken Gaillot <kgaillot at redhat.com>