[ClusterLabs] op stop timeout update causes monitor op to fail?

Tue Sep 17 15:41:30 EDT 2019

On 11.09.19 16:51, Ken Gaillot wrote:
> On Tue, 2019-09-10 at 09:54 +0200, Dennis Jacobfeuerborn wrote:
>> Hi,
>> I just updated the timeout for the stop operation on an nfs cluster
>> and
>> while the timeout was update the status suddenly showed this:
>>
>> Failed Actions:
>> * nfsserver_monitor_10000 on nfs1aqs1 'unknown error' (1): call=41,
>> status=Timed Out, exitreason='none',
>>     last-rc-change='Tue Aug 13 14:14:28 2019', queued=0ms, exec=0ms
> 
> Are you sure it wasn't already showing that? The timestamp of that
> error is Aug 13, while the logs show the timeout update happening Sep
> 10.

I'm fairly certain. I did a "pcs status" before that operation to check
the state of the cluster.

> 
> Old errors will keep showing up in status until you manually clean them
> up (with crm_resource --cleanup or a higher-level tool equivalent), or
> any configured failure-timeout is reached.
> 
> In any case, the log excerpt shows that nothing went wrong during the
> time it covers. There were no actions scheduled in that transition in
> response to the timeout change (which is as expected).

What about this line:
pengine:  warning: unpack_rsc_op_failure:	Processing failed op monitor
for nfsserver on nfs1aqs1: unknown error (1)

I cleaned up the error and tried this again and this time it worked. The
corresponding line in the log now reads:
pengine:     info: determine_op_status:	Operation monitor found resource
nfsserver active on nfs1aqs1

What I'm wondering is if this could be a race condition of pacemaker
updating the resource and the monitor operation.

Regards,
  Dennis