[ClusterLabs] Q: ordering for a monitoring op only?

Mon Aug 20 10:49:26 EDT 2018

On Mon, 2018-08-20 at 10:51 +0200, Ulrich Windl wrote:
> Hi!
> 
> I wonder whether it's possible to run a monitoring op only if some
> specific resource is up.
> Background: We have some resource that runs fine without NFS, but the
> start, stop and monitor operations will just hang if NFS is down. In
> effect the monitor operation will time out, the cluster will try to
> recover, calling the stop operation, which in turn will time out,
> making things worse (i.e.: causing a node fence).
> 
> So my idea was to pause the monitoing operation while NFS is down
> (NFS itself is controlled by the cluster and should recover "rather
> soon" TM).
> 
> Is that possible?

A possible mitigation would be to set on-fail=block on the dependent
resource monitor, so if NFS is down, the monitor will still time out,
but the cluster will not try to stop it. Of course then you lose the
ability to automatically recover from an actual resource failure.

The only other thing I can think of probably wouldn't be reliable: you
could put the NFS resource in a group with an ocf:pacemaker:attribute
resource. That way, whenever NFS is started, a node attribute will be
set, and whenever NFS is stopped, the attribute will be unset. Then,
you can set a rule using that attribute. For example you could make the
dependent resource's is-managed property depend on the node attribute
value. The reason I think it wouldn't be reliable is that if NFS
failed, there would be some time before the cluster stopped the NFS
resource and updated the node attribute, and the dependent resource
monitor could run during that time. But it would at least diminish the
problem space.

Probably any dynamic solution would have a similar race condition --
the NFS will be failed in reality for some amount of time before the
cluster detects the failure, so the cluster could never prevent the
monitor from running during that window.

> And before you ask: No, I have not written that RA that has the
> problem; a multi-million-dollar company wrote it (Years before I had
> written a monitor for HP-UX' cluster that did not have this problem,
> even though the configuration files were read from NFS (It's not
> magic: Just periodically copy them to shared memory, and read the
> config from shared memory).
> 
> Regards,
> Ulrich
-- 
Ken Gaillot <kgaillot at redhat.com>