[ClusterLabs] Antw: Re: Antw: Re: CRM location specification and errors

Fri Jun 26 08:52:14 UTC 2015

>>> Dejan Muhamedagic <dejanmm at fastmail.fm> schrieb am 26.06.2015 um 10:20 in
Nachricht <20150626082017.GA14759 at walrus.homenet>:

[...]
>> First, I think if a resource cannot be started on a node it's better to 
> return
>> OCF_ERR_INSTALLED rather than OCF_NOT_RUNNING, because it does not make any
>> sense to try to start the resource on that particular node. Then, how would 
> you
> 
> It is not always that simple. The part deemed not installed
> at the probes time may reappear later. For instance, some
> deployments have software on an NFS mount (I can recall that it
> was the case with SAP) and that NFS mount may not be available at
> the time.

Well, if you use NFS client mounted filesystems in resources, the NFS client should be started before the cluster. Once (hard) mounted, the filesystems should be there, even if the server fails. If the server is unavailable at mount-time, and you are using background mounts, you may be right (the mountpoints might appear later).
If you are providing and using NFS in the same cluster (maybe especially when providing /home) things may become tricky...

[A very annoying thing with SAP is the extensive use of NFS; if there is a NFS problem, the RAs time out, and the cluster thinks the system is not running, and to make things worse, a restart of the service will just hang (while waiting for NFS access). To make things worse, the cluster schedules a node fencing (killing more resources) when the stop times out...]

Still I think it's better to return "not installed" if it's clear the resource won't start (or stop) and use a "reprobe" at any later time if things changed rather than reporting "not running" and causing multiple start attempts that will surely fail. Opinions may vary, this is mine...

> 
> So, it's safer to return OCF_NOT_RUNNING and that is what quite a
> few RA do.

I don't know whether it is "safer", but it's simpler for sure.

[I once had written a monitor for SAP that carefully avoided to access NFS (actually it used asynchronous sub-processes to read from NFS into shared memory). And it reported three states (a different cluster system): not running, unknown, running
Especially for slow systems that was important, because a transition from "not running" to "unknown" might mean "starting", while the transition from "running" to "unknown" might mean "stopping". In the case of "unknown" my agent simply waited rechecking until some timeout... Definitely this was not simple, but being the result of several years of evolution (and system failures) it was quite "safe"]

Regards,
Ulrich