[ClusterLabs] Monitoring of Fence Agents?

Hayden,Robert RHAYDEN at CERNER.COM
Tue May 1 16:49:18 EDT 2018


Let me first thank this community for being so active.  I have learned a lot by watching the discussions.

I have noticed that in our environment, I am seeing high rate of my RHEL 7 fence agent (fence_ipmilan) timing out on monitoring operations.  We use HP iLO 3/4/5 power fencing.  I have attempted to figure out why we are seeing timeouts, but nothing appears to be miss-configured and there is not a pattern to the fence agent failures.

The timeout then shows up on pcs status output and the fence agent resource, if it does not relocate or restart, moves to a Stopped state.   I have tried to lengthen the monitoring to 15 minutes and start timeout to 125 seconds, but still are getting complaints from the System Admins.  They want a nice clean pcs status output and tend to freak out with the pcs stontih cleanup --node <node> command as it shows a false cycling of all resources (vip, fs, app, etc) for the node.

Wondering if anyone has real-world experience with some of the timeouts provided by fence_ipmilan with HP iLO devices.  In particular, I was looking at pcmk_status_timeout and pcmk_status_retries from pcs stonith describe fence_ipmilan -full.

How critical is the monitoring for the fence resources inside of pacemaker?   Can I simply disable the monitoring operation?  We have an independent job that periodically verifies HP iLO setup for fencing (did this in RHEL 6).

>From internal R&D testing, it appears that if the fence agent is "failed" or "stopped", and the cluster actually needs to fence a node, then the cluster will re-attempt the fence agent start and fence the node.

Here is the config

Stonith Devices:
Resource: fence_tval13 (class=stonith type=fence_ipmilan)
  Attributes: ipaddr=X.X.X.X lanplus=1 login=XXXXX method=onoff passwd=XXXXX pcmk_host_list=tval13 power_wait=20 privlvl=OPERATOR
  Operations: monitor interval=15m (fence_tval13-monitor-interval-15m)
              start interval=0s timeout=125s (fence_tval13-start-interval-0s)
Resource: fence_tval14 (class=stonith type=fence_ipmilan)
  Attributes: ipaddr=Y.Y.Y.Y lanplus=1 login= YYYYY method=onoff passwd=YYYYY pcmk_host_list=tval14 power_wait=20 privlvl=OPERATOR
  Operations: monitor interval=15m (fence_tval14-monitor-interval-15m)
              start interval=0s timeout=125s (fence_tval14-start-interval-0s)


Thanks
Robert




CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180501/0e10f2c2/attachment.html>


More information about the Users mailing list