[ClusterLabs] Antw: Re: Antw: Re: Resource won't start, crm_resource -Y does not help

Tue Jul 23 09:47:55 EDT 2019

On Tue, 2019-07-23 at 07:57 +0200, Ulrich Windl wrote:
> > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 22.07.2019 um
> > > > 18:14 in Nachricht
> 
> <e78507363a1bdc8c718d96ba5a339ae31c3e59d1.camel at redhat.com>:
> > On Mon, 2019-07-22 at 15:45 +0200, Ulrich Windl wrote:
> > > Hi!
> > > 
> > > My RA actually sends OCF_ERR_ARGS if checking the args detects a
> > > problem.
> > > But as the error can be resolved sometimes without changing the
> > > args
> > > (eg.
> > > providing some resource by other means), I suspect CRM does not
> > > handle that
> > > properly. Even after a resource cleanup.
> > > 
> > > My RA logs any parameter check, and I can see that no parameter
> > > check
> > > is being
> > > performed...
> > > 
> > > I also noticed that the "invalid parameter" persists on a node
> > > even
> > > after
> > > restarting pacemaker on that node.
> > 
> > Pacemaker treats OCF_ERR_ARGS as a "hard" failure, meaning it won't
> > be
> > retried on the same node. But it should attempt to start on any
> > other
> > eligible nodes.
> 
> This makes _some_ sense: If the parameters are unacceptable
> (OCF_ERR_ARGS) it really makes no sense to retry (Like havinf
> specified a host name that does not exist).
> However there are _two_ events that may change the state:
> 
> 1) If the parameters (e.g. hostname) is changed
> 
> 2) If the configuration outside the cluster was changed (e.g. making
> the hostname valid now)
> 
> In thge light of 2) I don't really see why a resource cleanup really
> does not reset the error condition. That is really unexpected.
> 
> > 
> > The failure should be cleared by either cleanup or pacemaker
> > restart.
> 
> According to my impression a cleanup did not change the condition but
> a cluster node restart did.

If a cleanup doesn't take care of it, something's going wrong.

> 
> > That's the mystery here. I can't even imagine how it would be
> > possible
> > to survive a pacemaker restart -- are you sure it wasn't simply a
> > new
> > attempt getting the same result?
> 
> According to the logs of my RA there were less parameter checks than
> expected, and the only explanation to me was that the result was
> cached somewhere.
> 
> 
> > 
> > > 
> > > So:
> > > # crm_resource -r prm_idredir_test -VV start
> > >  warning: unpack_rsc_op_failure:        Processing failed start
> > > of
> > > prm_idredir_test on h02: invalid parameter | rc=2
> > > 
> > > (Start was not even tried)
> > > 
> > > Eventually I was able to start the resource. Some other process
> > > had a
> > > socket
> > > address in use my resource needed...
> > 
> > Since you control the RA, you might want to set exit reasons, which
> > will be shown in the status display (the exitreason='' in your
> > output
> > below). There's an ocf_exit_reason convenience function, e.g.
> > 
> >    ocf_exit_reason "Some other process has the socket address in
> > use"
> >    exit $OCF_ERR_ARGS
> 
> Oh, this must be rather new ;-)
> 
> Since when is that available?
> 
> Regards,
> Ulrich

If you consider 2014 new :)

Of course it always takes a little longer to find its way into
distributions.
-- 
Ken Gaillot <kgaillot at redhat.com>