[ClusterLabs] Antw: Re: Antw: Re: Q: monitor and probe result codes and consequences

Fri May 13 11:00:49 UTC 2016

>>> Dejan Muhamedagic <dejanmm at fastmail.fm> schrieb am 13.05.2016 um 12:16 in
Nachricht <20160513101626.GA12493 at walrus.homenet>:
> Hi,
> 
> On Fri, May 13, 2016 at 09:05:54AM +0200, Ulrich Windl wrote:
>> >>> Ken Gaillot <kgaillot at redhat.com> schrieb am 12.05.2016 um 16:41 in Nachricht
>> <57349629.40408 at redhat.com>:
>> > On 05/12/2016 02:56 AM, Ulrich Windl wrote:
>> >> Hi!
>> >> 
>> >> I have a question regarding an RA written by myself and pacemaker 
>> > 1.1.12-f47ea56 (SLES11 SP4):
>> >> 
>> >> During "probe" all resources' "monitor" actions are executed (regardless of 
> 
>> > any ordering constraints). Therefore my RA considers a parameter as invalid 
> 
>> > ("file does not exist") (the file will be provided once some supplying 
>> > resource is up) and returns rc=2.
>> >> OK, this may not be optimal, but pacemaker makes it worse: It does not 
>> > repeat the probe once the resource would start, but keeps the state, 
>> > preventing a resource start:
>> >> 
>> >>  primitive_monitor_0 on h05 'invalid parameter' (2): call=73, 
>> > status=complete, exit-reason='none', last-rc-change='Wed May 11 17:03:39 
> 2016', 
>> > queued=0ms, exec=82ms
>> > 
>> > Correct, OCF_ERR_CONFIGURED is a "fatal" error:
>> > 
>> > 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explai 
> 
>> > ned/index.html#_how_are_ocf_return_codes_interpreted
>> 
>> I think the mistake here is assuming that OCF_ERR_CONFIGURED only depends on 
> the RA parameters, when in fact the validity of RA params may depend on the 
> environment found at the time of checking. And as we all now the environment 
> changes, especially when resources are started and stopped.
>> >> So you would say that monitor may only return "success" or "not running", 
>> > but I feel the RA should detect the condition that the resource could not 
> run 
>> > at all at the present state.
>> > 
>> > OCF_ERR_CONFIGURED is meant to indicate that the resource could not
>> > possibly run *as configured*, regardless of the system's current state.
>> 
>> But how do you handle parameters that describe file or host names (which may 
> exist or not independently of a change in the param's value)?
> 
> The RA should've exited with OCF_ERR_INSTALLED. That's the code
> which means that there's something wrong with the environment on
> this node, but that the resource could be started on another one.

Really, besides implementation, I don't see why OCF_ERR_INSTALLED is less permanent than  OCF_ERR_ARGS.

> 
>> > So for example, a required parameter is missing or invalid.
>> > 
>> > You could possibly use OCF_ERR_ARGS in this case (a "hard" error that
>> > bans the particular node, and means that the resource's configuration is
>> > not valid on this particular node).
>> 
>> ("rc=2" _is_ OCF_ERR_ARGS)
>> 
>> > 
>> > But, I suspect the right answer here is simply an order constraint
>> > between the supplying resource and this resource. This resource's start
>> 
>> As I said before the problem is probes: They are all started immediately 
> when a node comes up. And if a probe fails, no start is ever attempted later.
>> AFAIK probes ignore any colocation or ordering constraints.
> 
> Try with ocf_is_probe. On probes, if the file's missing, exit
> with OCF_NOT_RUNNING.

I did that in the meantime, but I consider this as a work-around. Spontaneously I feel each RA should have a separate "probe" action with the following semantics:
IF parameters seem valid AND the resource is up RETURN success
ELSE return not running.
Specifically this means any invalid parameter will result in "NOT RUNNING".

I'm not saying having a separate "probe" will make things better; it's just clear that probes may be called from "invalid context" (I name it).

> 
>> > action, not monitor, should be the one that checks for the existence of
>> > the supplied file.
>> 
>> So it all depends what an invalid parameter (in the sense of validate-all) 
> actually is. Maybe the documentation should be more clear abou that.
>> (As a matter of fact, the only RA docs I found yesterday are still hosted at 
> linux-ha (from linbit))
> 
> The developer's guide does have a note about probes and
> ocf_is_probe and even some sample code (see the monitor and
> validate-all actions).
> 
>> >> Shouldn't pacemaker reprobe resources before it tries to start them?
>> > 
>> > Probes are meant to check whether the resource is already active
>> > anywhere. The decision of whether and where to start the resource takes
>> > into account the result of the probes, so it doesn't make sense to
>> > re-probe -- that's what the initial probe was for.
>> 
>> In case a _probe_ returned an error indicating it was not able to decide, it 
> would make sense to reprobe after colocation and ordering constraiints are 
> fulfilled.
>> 
>> One of the first lessions I've learned with HA was that the state of a 
> resource is not Boolean (up or down), but multistate (I wrote on that after I 
> had started to use pacemaker):
>> A resource can be "down", "starting", "started/up", "stopping", 
> "down/stopped" and "undecided" (the case whare retries can make sense). 
> During states "starting" and "stopping" it does not make sense to enforce a 
> Boolean result like "up" or "down", because the resource may fail to perform 
> the transition, and the resource may hang in the state of "starting" or 
> "stopping" (wich both are definitely not "up").
>> If you need a specific directory to decide whether a specific resource is up 
> or down, and that directory does not exist, you cannot really make a 
> statement about the resource's state: Most likely it won't be "up", but you 
> cannot say for sure that it is down (e.g. you have a state file on NFS and 
> NFS server is down at the time of checking).
>> 
> 
> The transitioning period (i.e. start, stop, promote, demote) for
> the resource may take a while, depending on the resource and the
> actual deployment, which is why setting the timeouts properly is
> important. Note that probes are meant to run mostly on node
> startup, that is to establish the initial state.
> 
> If the RA, on probe, cannot decide the state the resource is in,
> then it should return OCF_RA_GENERIC and let the resource manager
> cleanup (by invoking stop). Furthermore, the RA must take into
> account that some of its dependencies may not be running.

But if the problem is due to "invalid calling context", then action "stop" will have the same problem (and return an error as well). STONITH deathmatch then?

Regards,
Ulrich
[...]