[ClusterLabs] Resource start and stop run into timeout

Ken Gaillot kgaillot at redhat.com
Fri Jun 5 11:33:53 EDT 2020


On Fri, 2020-06-05 at 13:06 +0200, joerg at die-kilians.eu wrote:
> Dear all,
> 
> I have an issue on my cluster running Pacemaker 2.0.3.
>  From time to time my resources fail to start or fail to stop. The 
> cluster runs into an timeout for the command.

This is a known bug in 2.0.3 when it is built with
-UPCMK_TIME_EMERGENCY_CGT in the C flags, which is not the default.
Fedora 31 and an Ubuntu release are known to have used that option (bug
reports are open for both distributions, and hopefully updated packages
will be available before long).

The only fix is to use a different pacemaker version on those
platforms.

> Between the attempt to start the resource and the timeout is only 1 
> second (between Jun 04 13:48:37 and Jun 04 13:48:38).
> 
> The resource is defined like this:
> 	Resource: nmb (class=systemd type=nmb)
> 	  Operations: monitor interval=60s timeout=100s 
> (nmb-monitor-interval-60s)
>          	      start interval=0s timeout=600s (nmb-start-
> interval-0s)
> 	              stop interval=0s timeout=600s (nmb-stop-interval-
> 0s)
> 
> So the timeout should be 600 seconds. The cluster states that the 
> timeout has reached (timeout (elapsed=647561ms, remaining=-
> 47561ms)), 
> but obviously only 1 second has passed.
> 
> Do you have any hint for me ?
> 
> Below is a log from that time.
> 
> Thanks
> 
> Log:
> Jun 04 13:48:37 node1 pacemaker-controld  [1162]
> (process_graph_event)   
>       info: Transition 641 action 31 (smb_stop_0 on node1) confirmed:
> ok 
> > rc=0 call-id=896
> 
> Jun 04 13:48:37 node1 pacemaker-controld  [1162]
> (te_rsc_command)     
> notice: Initiating stop operation nmb_stop_0 locally on node1 |
> action 
> 29
> Jun 04 13:48:37 node1 pacemaker-execd     [1159] 
> (cancel_recurring_action)    info: Cancelling systemd operation 
> nmb_status_60000
> Jun 04 13:48:37 node1 pacemaker-based     [1157]
> (cib_process_request)   
>       info: Completed cib_modify operation for section status: OK
> (rc=0, 
> origin=node1/crmd/1383, version=1.141.45)
> Jun 04 13:48:37 node1 pacemaker-controld  [1162]
> (do_lrm_rsc_op)      
> info: Performing key=29:641:0:dbf25643-c4e8-4374-8aec-3c9e3ea3d397 
> op=nmb_stop_0
> Jun 04 13:48:37 node1 pacemaker-execd     [1159]
> (log_execute)        
> info: executing - rsc:nmb action:stop call_id:909
> Jun 04 13:48:37 node1 pacemaker-controld  [1162]
> (process_lrm_event)  
> info: Result of monitor operation for nmb on node1: Cancelled |
> call=329 
> key=nmb_monitor_60000 confirmed=true
> Jun 04 13:48:37 node1 pacemaker-controld  [1162]
> (exec_alert_list)    
> info: Sending resource alert via alert to (null)
> Jun 04 13:48:37 node1 pacemaker-based     [1157]
> (cib_process_request)   
>       info: Forwarding cib_modify operation for section status to
> all 
> (origin=local/crmd/1384)
> Jun 04 13:48:37 node1 pacemaker-execd     [1159] 
> (process_lrmd_alert_exec)    info: Executing alert alert for 
> 8f105ac8-de01-4d3b-9881-10963acd7002
> Jun 04 13:48:37 node1 pacemaker-based     [1157]
> (cib_perform_op)     
> info: Diff: --- 1.141.45 2
> Jun 04 13:48:37 node1 pacemaker-based     [1157]
> (cib_perform_op)     
> info: Diff: +++ 1.141.46 (null)
> Jun 04 13:48:37 node1 pacemaker-based     [1157]
> (cib_perform_op)     
> info: +  /cib:  @num_updates=46
> Jun 04 13:48:37 node1 pacemaker-based     [1157]
> (cib_perform_op)     
> info: +  
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resour
> ce[@id='nmb']/lrm_rsc_op[@id='nmb_last_0']: 
>   @operation_key=nmb_stop_0, @operation=stop, 
> @crm-debug-origin=do_update_resource, 
> @transition-key=29:641:0:dbf25643-c4e8-4374-8aec-3c9e3ea3d397, 
> @transition-magic=-1:193;29:641:0:dbf25643-c4e8-4374-8aec-
> 3c9e3ea3d397, 
> @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-
> change=1591271317, 
> @last-run=1591271317, @e
> Jun 04 13:48:37 node1 pacemaker-based     [1157]
> (cib_process_request)   
>       info: Completed cib_modify operation for section status: OK
> (rc=0, 
> origin=node1/crmd/1384, version=1.141.46)
> Jun 04 13:48:38 node1 pacemaker-execd     [1159]
> (systemd_exec_result)   
>       info: Call to stop passed: /org/freedesktop/systemd1/job/189965
> Jun 04 13:48:38 node1 pacemaker-execd     [1159]
> (action_complete)    
> notice: Giving up on nmb stop (rc=0): timeout (elapsed=647561ms, 
> remaining=-47561ms)
> Jun 04 13:48:38 node1 pacemaker-based     [1157]
> (cib_process_request)   
>       info: Forwarding cib_modify operation for section status to
> all 
> (origin=local/crmd/1385)
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list