[ClusterLabs] Antw: Re: Q: ordering for a monitoring op only?

Tue Aug 21 05:49:37 UTC 2018

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 20.08.2018 um 16:49 in
Nachricht
<1534776566.6465.5.camel at redhat.com>:
> On Mon, 2018‑08‑20 at 10:51 +0200, Ulrich Windl wrote:
>> Hi!
>> 
>> I wonder whether it's possible to run a monitoring op only if some
>> specific resource is up.
>> Background: We have some resource that runs fine without NFS, but the
>> start, stop and monitor operations will just hang if NFS is down. In
>> effect the monitor operation will time out, the cluster will try to
>> recover, calling the stop operation, which in turn will time out,
>> making things worse (i.e.: causing a node fence).
>> 
>> So my idea was to pause the monitoing operation while NFS is down
>> (NFS itself is controlled by the cluster and should recover "rather
>> soon" TM).
>> 
>> Is that possible?
> 
> A possible mitigation would be to set on‑fail=block on the dependent
> resource monitor, so if NFS is down, the monitor will still time out,
> but the cluster will not try to stop it. Of course then you lose the
> ability to automatically recover from an actual resource failure.
> 
> The only other thing I can think of probably wouldn't be reliable: you
> could put the NFS resource in a group with an ocf:pacemaker:attribute
> resource. That way, whenever NFS is started, a node attribute will be
> set, and whenever NFS is stopped, the attribute will be unset. Then,
> you can set a rule using that attribute. For example you could make the
> dependent resource's is‑managed property depend on the node attribute
> value. The reason I think it wouldn't be reliable is that if NFS
> failed, there would be some time before the cluster stopped the NFS
> resource and updated the node attribute, and the dependent resource
> monitor could run during that time. But it would at least diminish the
> problem space.

Hi!

That sounds interesting, even though it's still a work-around and not the
solution for the original problem. Could you show a sketch of the mechanism:
How to set the attribute with the resource, and how to make the monitor
operation depend on it?

> 
> Probably any dynamic solution would have a similar race condition ‑‑
> the NFS will be failed in reality for some amount of time before the
> cluster detects the failure, so the cluster could never prevent the
> monitor from running during that window.

I agree completely.

Regards,
Ulrich

> 
>> And before you ask: No, I have not written that RA that has the
>> problem; a multi‑million‑dollar company wrote it (Years before I had
>> written a monitor for HP‑UX' cluster that did not have this problem,
>> even though the configuration files were read from NFS (It's not
>> magic: Just periodically copy them to shared memory, and read the
>> config from shared memory).
>> 
>> Regards,
>> Ulrich
> ‑‑ 
> Ken Gaillot <kgaillot at redhat.com>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org