[ClusterLabs] Q: ordering for a monitoring op only?
jpokorny at redhat.com
Mon Aug 20 16:39:40 UTC 2018
On 20/08/18 10:51 +0200, Ulrich Windl wrote:
> I wonder whether it's possible to run a monitoring op only if some
> specific resource is up.
> Background: We have some resource that runs fine without NFS, but
> the start, stop and monitor operations will just hang if NFS is
> down. In effect the monitor operation will time out, the cluster
> will try to recover, calling the stop operation, which in turn will
> time out, making things worse (i.e.: causing a node fence).
> So my idea was to pause the monitoing operation while NFS is down
> (NFS itself is controlled by the cluster and should recover "rather
> soon" TM).
> Is that possible?
> And before you ask: No, I have not written that RA that has the
> problem; a multi-million-dollar company wrote it (Years before I had
> written a monitor for HP-UX' cluster that did not have this problem,
> even though the configuration files were read from NFS (It's not
> magic: Just periodically copy them to shared memory, and read the
> config from shared memory).
Sorry for stating likely obvious; in a similar spirit, if the agent
at hand allows configuring the config location, you can synchronize
the shared copy in the offline node-local mirrors, e.g. using csync2.
The problem then boils down to whether "cluster approved,
synchronized and fresh" version is what gets used.
It doesn't look there's any silver bullet, any attempt to overcome
"holistic integrity" (on its own the native approach with pacemaker,
anything else is swimming against the stream) may bite you/affect HA
at some possibly unanticipated point.
If you don't want or cannot mangle (wrap call outs, etc.) with the
resource agents, your best bet is to ask the respective author/vendor
to honour OCF_CHECK_LEVEL in "monitor" action properly, meaning
that no file-based traversal (possibly getting stuck on NFS access)
would be attempted by default (level "0", but could be with level of
"10" or more), and do not set it artificially to higher levels
in your configuration (or conditionalize similarly to what Ken
suggested). Apparently, this won't fix "stop" issues, for instance.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 819 bytes
Desc: not available
More information about the Users