[ClusterLabs] Q: monitor and probe result codes and consequences

Thu May 12 10:41:45 EDT 2016

On 05/12/2016 02:56 AM, Ulrich Windl wrote:
> Hi!
> 
> I have a question regarding an RA written by myself and pacemaker 1.1.12-f47ea56 (SLES11 SP4):
> 
> During "probe" all resources' "monitor" actions are executed (regardless of any ordering constraints). Therefore my RA considers a parameter as invalid ("file does not exist") (the file will be provided once some supplying resource is up) and returns rc=2.
> OK, this may not be optimal, but pacemaker makes it worse: It does not repeat the probe once the resource would start, but keeps the state, preventing a resource start:
> 
>  primitive_monitor_0 on h05 'invalid parameter' (2): call=73, status=complete, exit-reason='none', last-rc-change='Wed May 11 17:03:39 2016', queued=0ms, exec=82ms

Correct, OCF_ERR_CONFIGURED is a "fatal" error:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_how_are_ocf_return_codes_interpreted

> So you would say that monitor may only return "success" or "not running", but I feel the RA should detect the condition that the resource could not run at all at the present state.

OCF_ERR_CONFIGURED is meant to indicate that the resource could not
possibly run *as configured*, regardless of the system's current state.
So for example, a required parameter is missing or invalid.

You could possibly use OCF_ERR_ARGS in this case (a "hard" error that
bans the particular node, and means that the resource's configuration is
not valid on this particular node).

But, I suspect the right answer here is simply an order constraint
between the supplying resource and this resource. This resource's start
action, not monitor, should be the one that checks for the existence of
the supplied file.

> Shouldn't pacemaker reprobe resources before it tries to start them?

Probes are meant to check whether the resource is already active
anywhere. The decision of whether and where to start the resource takes
into account the result of the probes, so it doesn't make sense to
re-probe -- that's what the initial probe was for.

> Before my RA had passed all the ocf-tester checks, so this situation is hard to test (unless you have a test cluster you can restart any time).
> 
> (After manual resource cleanup the resource started as usual)
> 
> My monitor uses the following logic:
> ---
>     monitor|status)
>         if validate; then
>             set_variables
>             check_resource || exit $OCF_NOT_RUNNING
>             status=$OCF_SUCCESS
>         else # cannot check status with invalid parameters
>             status=$?
>         fi
>         exit $status
>         ;;
> ---
> 
> Should I mess with ocf_is_probe?
> 
> Regards,
> Ulrich