[ClusterLabs] op stop timeout update causes monitor op to fail?

Tue Sep 17 20:22:15 EDT 2019

On Tue, 2019-09-17 at 21:41 +0200, Dennis Jacobfeuerborn wrote:
> On 11.09.19 16:51, Ken Gaillot wrote:
> > On Tue, 2019-09-10 at 09:54 +0200, Dennis Jacobfeuerborn wrote:
> > > Hi,
> > > I just updated the timeout for the stop operation on an nfs
> > > cluster
> > > and
> > > while the timeout was update the status suddenly showed this:
> > > 
> > > Failed Actions:
> > > * nfsserver_monitor_10000 on nfs1aqs1 'unknown error' (1):
> > > call=41,
> > > status=Timed Out, exitreason='none',
> > >     last-rc-change='Tue Aug 13 14:14:28 2019', queued=0ms,
> > > exec=0ms
> > 
> > Are you sure it wasn't already showing that? The timestamp of that
> > error is Aug 13, while the logs show the timeout update happening
> > Sep
> > 10.
> 
> I'm fairly certain. I did a "pcs status" before that operation to
> check
> the state of the cluster.
> 
> > 
> > Old errors will keep showing up in status until you manually clean
> > them
> > up (with crm_resource --cleanup or a higher-level tool equivalent),
> > or
> > any configured failure-timeout is reached.
> > 
> > In any case, the log excerpt shows that nothing went wrong during
> > the
> > time it covers. There were no actions scheduled in that transition
> > in
> > response to the timeout change (which is as expected).
> 
> What about this line:
> pengine:  warning: unpack_rsc_op_failure:	Processing failed op
> monitor
> for nfsserver on nfs1aqs1: unknown error (1)

That's shown whenever there's an uncleaned failure in the cluster
history, not necessarily when the failure first occurred. It's a
longstanding intent to improve the message to make that clearer.

> I cleaned up the error and tried this again and this time it worked.
> The
> corresponding line in the log now reads:
> pengine:     info: determine_op_status:	Operation monitor found
> resource
> nfsserver active on nfs1aqs1
> 
> What I'm wondering is if this could be a race condition of pacemaker
> updating the resource and the monitor operation.

I don't think so -- the cluster didn't take any action in response to
the timeout change. Just before any "Calculated transition" message,
the "LogActions" lines will show what the cluster needs to do in
response to the event. In this case it's all "Leave" which means do
nothing.

> 
> Regards,
>   Dennis
-- 
Ken Gaillot <kgaillot at redhat.com>