[ClusterLabs] Monitoring of Fence Agents?

Wed May 2 12:11:29 EDT 2018

On Tue, 2018-05-01 at 20:49 +0000, Hayden,Robert wrote:
> Let me first thank this community for being so active.  I have
> learned a lot by watching the discussions.
>  
> I have noticed that in our environment, I am seeing high rate of my
> RHEL 7 fence agent (fence_ipmilan) timing out on monitoring
> operations.  We use HP iLO 3/4/5 power fencing.  I have attempted to
> figure out why we are seeing timeouts, but nothing appears to be
> miss-configured and there is not a pattern to the fence agent
> failures.
>  
> The timeout then shows up on pcs status output and the fence agent
> resource, if it does not relocate or restart, moves to a Stopped
> state.   I have tried to lengthen the monitoring to 15 minutes and
> start timeout to 125 seconds, but still are getting complaints from 

With those long timeouts, I'm guessing the iLO is occasionally
returning an error result (as opposed to not responding at all).

> the System Admins.  They want a nice clean pcs status output and tend
> to freak out with the pcs stontih cleanup --node <node> command as it
> shows a false cycling of all resources (vip, fs, app, etc) for the
> node.

Good news on that front:

* You can clean up just the stonith resource with pcs stonith cleanup
<resource> --node <node> 

* In Pacemaker 2.0, cleanup will clean only resource failures, rather
than redetect all resources. There will be a separate command for full
redetection.

* Rather than clean up the failures, you can use the failure-timeout
resource meta-attribute to automatically expire them after a certain
amount of time.

> Wondering if anyone has real-world experience with some of the
> timeouts provided by fence_ipmilan with HP iLO devices.  In
> particular, I was looking at pcmk_status_timeout and
> pcmk_status_retries from pcs stonith describe fence_ipmilan –full.

Exactly, pcmk_*_retries was intended for devices with flaky responses.
Try raising it and see if the frequency of problems goes down. I think
it's pcmk_monitor_retries though; I forget why status is separate.

> How critical is the monitoring for the fence resources inside of
> pacemaker?   Can I simply disable the monitoring operation?  We have
> an independent job that periodically verifies HP iLO setup for
> fencing (did this in RHEL 6).

Monitoring is not essential. Pacemaker can use a fence device even if
the monitor fails or the resource is stopped due to failure.

However the point of monitoring is to let you know if your fence device
died, before you need it. :) In your case, if you're monitoring it
separately, it would be OK to disable monitoring (or set
pcmk_monitor_action=metadata so it always succeeds).

> From internal R&D testing, it appears that if the fence agent is
> “failed” or “stopped”, and the cluster actually needs to fence a
> node, then the cluster will re-attempt the fence agent start and
> fence the node.

I don't recall any particular recovery logic there. It will try to
recover a failed fence device according to whatever policy you have
configured (by default, try restarting it a million times).

>  
> Here is the config
>  
> Stonith Devices:
> Resource: fence_tval13 (class=stonith type=fence_ipmilan)
>   Attributes: ipaddr=X.X.X.X lanplus=1 login=XXXXX method=onoff
> passwd=XXXXX pcmk_host_list=tval13 power_wait=20 privlvl=OPERATOR
>   Operations: monitor interval=15m (fence_tval13-monitor-interval-
> 15m)
>               start interval=0s timeout=125s (fence_tval13-start-
> interval-0s)
> Resource: fence_tval14 (class=stonith type=fence_ipmilan)
>   Attributes: ipaddr=Y.Y.Y.Y lanplus=1 login= YYYYY method=onoff
> passwd=YYYYY pcmk_host_list=tval14 power_wait=20 privlvl=OPERATOR
>   Operations: monitor interval=15m (fence_tval14-monitor-interval-
> 15m)
>               start interval=0s timeout=125s (fence_tval14-start-
> interval-0s)

It occurs to me that it may be a good idea to set pcmk_monitor_timeout
the same as the operation timeout. The operation timeout applies to the
recurring monitoring initiated by the controller daemon, which is what
you're interested in, but pcmk_monitor_timeout would apply when the
fencing daemon needs to execute a monitor action on its own (which
probably doesn't happen often).

> Thanks
> Robert
-- 
Ken Gaillot <kgaillot at redhat.com>