[ClusterLabs] Antw: Re: Antw: Re: Resource won't start, crm_resource -Y does not help

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Tue Jul 23 01:57:09 EDT 2019


>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 22.07.2019 um 18:14 in Nachricht
<e78507363a1bdc8c718d96ba5a339ae31c3e59d1.camel at redhat.com>:
> On Mon, 2019-07-22 at 15:45 +0200, Ulrich Windl wrote:
>> Hi!
>> 
>> My RA actually sends OCF_ERR_ARGS if checking the args detects a
>> problem.
>> But as the error can be resolved sometimes without changing the args
>> (eg.
>> providing some resource by other means), I suspect CRM does not
>> handle that
>> properly. Even after a resource cleanup.
>> 
>> My RA logs any parameter check, and I can see that no parameter check
>> is being
>> performed...
>> 
>> I also noticed that the "invalid parameter" persists on a node even
>> after
>> restarting pacemaker on that node.
> 
> Pacemaker treats OCF_ERR_ARGS as a "hard" failure, meaning it won't be
> retried on the same node. But it should attempt to start on any other
> eligible nodes.

This makes _some_ sense: If the parameters are unacceptable (OCF_ERR_ARGS) it really makes no sense to retry (Like havinf specified a host name that does not exist).
However there are _two_ events that may change the state:

1) If the parameters (e.g. hostname) is changed

2) If the configuration outside the cluster was changed (e.g. making the hostname valid now)

In thge light of 2) I don't really see why a resource cleanup really does not reset the error condition. That is really unexpected.

> 
> The failure should be cleared by either cleanup or pacemaker restart.

According to my impression a cleanup did not change the condition but a cluster node restart did.

> That's the mystery here. I can't even imagine how it would be possible
> to survive a pacemaker restart -- are you sure it wasn't simply a new
> attempt getting the same result?

According to the logs of my RA there were less parameter checks than expected, and the only explanation to me was that the result was cached somewhere.


> 
>> 
>> So:
>> # crm_resource -r prm_idredir_test -VV start
>>  warning: unpack_rsc_op_failure:        Processing failed start of
>> prm_idredir_test on h02: invalid parameter | rc=2
>> 
>> (Start was not even tried)
>> 
>> Eventually I was able to start the resource. Some other process had a
>> socket
>> address in use my resource needed...
> 
> Since you control the RA, you might want to set exit reasons, which
> will be shown in the status display (the exitreason='' in your output
> below). There's an ocf_exit_reason convenience function, e.g.
> 
>    ocf_exit_reason "Some other process has the socket address in use"
>    exit $OCF_ERR_ARGS

Oh, this must be rather new ;-)

Since when is that available?

Regards,
Ulrich
 




More information about the Users mailing list