[ClusterLabs] Antw: Re: Q: monitor and probe result codes and consequences

Fri May 13 10:16:26 UTC 2016

Hi,

On Fri, May 13, 2016 at 09:05:54AM +0200, Ulrich Windl wrote:
> >>> Ken Gaillot <kgaillot at redhat.com> schrieb am 12.05.2016 um 16:41 in Nachricht
> <57349629.40408 at redhat.com>:
> > On 05/12/2016 02:56 AM, Ulrich Windl wrote:
> >> Hi!
> >> 
> >> I have a question regarding an RA written by myself and pacemaker 
> > 1.1.12-f47ea56 (SLES11 SP4):
> >> 
> >> During "probe" all resources' "monitor" actions are executed (regardless of 
> > any ordering constraints). Therefore my RA considers a parameter as invalid 
> > ("file does not exist") (the file will be provided once some supplying 
> > resource is up) and returns rc=2.
> >> OK, this may not be optimal, but pacemaker makes it worse: It does not 
> > repeat the probe once the resource would start, but keeps the state, 
> > preventing a resource start:
> >> 
> >>  primitive_monitor_0 on h05 'invalid parameter' (2): call=73, 
> > status=complete, exit-reason='none', last-rc-change='Wed May 11 17:03:39 2016', 
> > queued=0ms, exec=82ms
> > 
> > Correct, OCF_ERR_CONFIGURED is a "fatal" error:
> > 
> > http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explai 
> > ned/index.html#_how_are_ocf_return_codes_interpreted
> 
> I think the mistake here is assuming that OCF_ERR_CONFIGURED only depends on the RA parameters, when in fact the validity of RA params may depend on the environment found at the time of checking. And as we all now the environment changes, especially when resources are started and stopped.
> >> So you would say that monitor may only return "success" or "not running", 
> > but I feel the RA should detect the condition that the resource could not run 
> > at all at the present state.
> > 
> > OCF_ERR_CONFIGURED is meant to indicate that the resource could not
> > possibly run *as configured*, regardless of the system's current state.
> 
> But how do you handle parameters that describe file or host names (which may exist or not independently of a change in the param's value)?

The RA should've exited with OCF_ERR_INSTALLED. That's the code
which means that there's something wrong with the environment on
this node, but that the resource could be started on another one.

> > So for example, a required parameter is missing or invalid.
> > 
> > You could possibly use OCF_ERR_ARGS in this case (a "hard" error that
> > bans the particular node, and means that the resource's configuration is
> > not valid on this particular node).
> 
> ("rc=2" _is_ OCF_ERR_ARGS)
> 
> > 
> > But, I suspect the right answer here is simply an order constraint
> > between the supplying resource and this resource. This resource's start
> 
> As I said before the problem is probes: They are all started immediately when a node comes up. And if a probe fails, no start is ever attempted later.
> AFAIK probes ignore any colocation or ordering constraints.

Try with ocf_is_probe. On probes, if the file's missing, exit
with OCF_NOT_RUNNING.

> > action, not monitor, should be the one that checks for the existence of
> > the supplied file.
> 
> So it all depends what an invalid parameter (in the sense of validate-all) actually is. Maybe the documentation should be more clear abou that.
> (As a matter of fact, the only RA docs I found yesterday are still hosted at linux-ha (from linbit))

The developer's guide does have a note about probes and
ocf_is_probe and even some sample code (see the monitor and
validate-all actions).

> >> Shouldn't pacemaker reprobe resources before it tries to start them?
> > 
> > Probes are meant to check whether the resource is already active
> > anywhere. The decision of whether and where to start the resource takes
> > into account the result of the probes, so it doesn't make sense to
> > re-probe -- that's what the initial probe was for.
> 
> In case a _probe_ returned an error indicating it was not able to decide, it would make sense to reprobe after colocation and ordering constraiints are fulfilled.
> 
> One of the first lessions I've learned with HA was that the state of a resource is not Boolean (up or down), but multistate (I wrote on that after I had started to use pacemaker):
> A resource can be "down", "starting", "started/up", "stopping", "down/stopped" and "undecided" (the case whare retries can make sense). During states "starting" and "stopping" it does not make sense to enforce a Boolean result like "up" or "down", because the resource may fail to perform the transition, and the resource may hang in the state of "starting" or "stopping" (wich both are definitely not "up").
> If you need a specific directory to decide whether a specific resource is up or down, and that directory does not exist, you cannot really make a statement about the resource's state: Most likely it won't be "up", but you cannot say for sure that it is down (e.g. you have a state file on NFS and NFS server is down at the time of checking).
> 

The transitioning period (i.e. start, stop, promote, demote) for
the resource may take a while, depending on the resource and the
actual deployment, which is why setting the timeouts properly is
important. Note that probes are meant to run mostly on node
startup, that is to establish the initial state.

If the RA, on probe, cannot decide the state the resource is in,
then it should return OCF_RA_GENERIC and let the resource manager
cleanup (by invoking stop). Furthermore, the RA must take into
account that some of its dependencies may not be running.

HTH,

Dejan

> >> Before my RA had passed all the ocf-tester checks, so this situation is hard 
> > to test (unless you have a test cluster you can restart any time).
> >> 
> >> (After manual resource cleanup the resource started as usual)
> >> 
> >> My monitor uses the following logic:
> >> ---
> >>     monitor|status)
> >>         if validate; then
> >>             set_variables
> >>             check_resource || exit $OCF_NOT_RUNNING
> >>             status=$OCF_SUCCESS
> >>         else # cannot check status with invalid parameters
> >>             status=$?
> >>         fi
> >>         exit $status
> >>         ;;
> >> ---
> >> 
> >> Should I mess with ocf_is_probe?
> >> 
> >> Regards,
> >> Ulrich
> > 
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org 
> > http://clusterlabs.org/mailman/listinfo/users 
> > 
> > Project Home: http://www.clusterlabs.org 
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> > Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org