[Pacemaker] Time out issue while stopping resource in pacemaker

Fri Oct 10 04:03:48 EDT 2014

Andrew Beekhof <andrew at ...> writes:

> 
> 
> On 10 Oct 2014, at 12:12 pm, Lax <lkota at ...> wrote:
> 
> > Hi All,
> > 
> > I ran into a time out issue while failing over from master to the peer
> > server and I have a 2 node setup with 2 resources. Though it was working all
> > along, this was the first time this issue is seen for me.
> > 
> > It fail with following error 'error: process_lrm_event: LRM operation
> > resourceB_stop_0 (40) Timed Out (timeout=20000ms)'.
> > 
> 
> Have you considered making the timeout longer?
> 
> > 
> > 
> > Here is the complete log snippet from pacemaker, appreciate your help on
this.
> > 
> > 
> > Oct  9 14:57:38 server1 cib[368]:   notice: cib:diff: Diff: +++ 0.3.1
> > 4e9bfa03cf2fef61843c18e127044d81
> > Oct  9 14:57:38 server1 cib[368]:   notice: cib:diff: -- <cib
> > admin_epoch="0" epoch="2" num_updates="8" />
> > Oct  9 14:57:38 server1 crmd[373]:   notice: do_state_transition: State
> > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> > origin=abort_transition_graph ]
> > Oct  9 14:57:38 server1 cib[368]:   notice: cib:diff: ++        
> > <instance_attributes id="nodes-server1" >
> > Oct  9 14:57:38 server1 cib[368]:   notice: cib:diff: ++           <nvpair
> > id="nodes-server1-standby" name="standby" value="true" />
> > Oct  9 14:57:38 server1 cib[368]:   notice: cib:diff: ++        
> > </instance_attributes>
> > Oct  9 14:57:38 server1 pengine[372]:   notice: unpack_config: On loss of
> > CCM Quorum: Ignore
> > Oct  9 14:57:38 server1 pengine[372]:   notice: LogActions: Move   
> > ClusterIP#011(Started server1 -> 172.28.0.64)
> > Oct  9 14:57:38 server1 pengine[372]:   notice: LogActions: Move   
> > resourceB#011(Started server1 -> 172.28.0.64)
> > Oct  9 14:57:38 server1 pengine[372]:   notice: process_pe_message:
> > Calculated Transition 11: /var/lib/pacemaker/pengine/pe-input-1710.bz2
> > Oct  9 14:57:58 server1 lrmd[370]:  warning: child_timeout_callback:
> > resourceB_stop_0 process (PID 17327) timed out
> > Oct  9 14:57:58 server1 lrmd[370]:  warning: operation_finished:
> > resourceB_stop_0:17327 - timed out after 20000ms
> > Oct  9 14:57:58 server1 lrmd[370]:   notice: operation_finished:
> > resourceB_stop_0:17327 [   % Total    % Received % Xferd  Average Speed  
> > Time    Time     Time  Current ]
> > Oct  9 14:57:58 server1 lrmd[370]:   notice: operation_finished:
> > resourceB_stop_0:17327 [                                  Dload  Upload  
> > Total   Spent    Left  Speed ]
> > Oct  9 14:57:58 server1 lrmd[370]:   notice: operation_finished:
> > resourceB_stop_0:17327 [ #015  0     0    0     0    0     0      0      0
> > --:--:-- --:--:-- --:--:--     0#015  0     0    0     0    0     0      0 
> >    0 --:--:--  0:00:01 --:--:--     0#015  0     0    0     0    0     0  
> >   0      0 --:--:--  0:00:02 --:--:--     0#015  0     0    0     0    0  
> >  0      0      0 --:--:--  0:00:03 --:--:--     0#015  0     0    0     0 
> >  0     0      0      0 --:--:--  0:00:04 --:--:--     0#015  0     0    0 
> >   0    0     0      0      0 --:--:--  0:00:05 -
> > Oct  9 14:57:58 server1 crmd[373]:    error: process_lrm_event: LRM
> > operation resourceB_stop_0 (40) Timed Out (timeout=20000ms)
> > Oct  9 14:57:58 server1 crmd[373]:  warning: status_from_rc: Action 10
> > (resourceB_stop_0) on server1 failed (target: 0 vs. rc: 1): Error
> > Oct  9 14:57:58 server1 crmd[373]:  warning: update_failcount: Updating
> > failcount for resourceB on server1 after failed stop: rc=1 (update=INFINITY,
> > time=1412891878)
> > Oct  9 14:57:58 server1 attrd[371]:   notice: attrd_trigger_update: Sending
> > flush op to all hosts for: fail-count-resourceB (INFINITY)
> > Oct  9 14:57:58 server1 crmd[373]:  warning: update_failcount: Updating
> > failcount for resourceB on server1 after failed stop: rc=1 (update=INFINITY,
> > time=1412891878)
> > Oct  9 14:57:58 server1 crmd[373]:   notice: run_graph: Transition 11
> > (Complete=2, Pending=0, Fired=0, Skipped=9, Incomplete=0,
> > Source=/var/lib/pacemaker/pengine/pe-input-1710.bz2): Stopped
> > Oct  9 14:57:58 server1 attrd[371]:   notice: attrd_perform_update: Sent
> > update 11: fail-count-resourceB=INFINITY
> > 
> > 
> > Thanks
> > Lax
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at ...
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at ...
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

Thanks for getting back Andrew.

I have only given timeout for monitor, but not for stop in the resource
definition. Do you mean I increase timeout for stop?

Also when I try to force stop pacemaker service, it keeps saying 'Waiting
for shutdown of managed resources;' and does not stop.

Pacemaker log says failed to stop because of unknown error

Oct 10 00:36:25 server1 crmd[373]:   notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 10 00:51:25 server1 crmd[373]:   notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
origin=crm_timer_popped ]
Oct 10 00:51:25 server1 pengine[372]:   notice: unpack_config: On loss of
CCM Quorum: Ignore
Oct 10 00:51:25 server1 pengine[372]:  warning: unpack_rsc_op: Processing
failed op stop for resourceB on server1.cisco.com: unknown error (1)
Oct 10 00:51:25 server1 pengine[372]:  warning: common_apply_stickiness:
Forcing resourceB away from server1.cisco.com after 1000000 failures
(max=1000000)
Oct 10 00:51:25 server1 pengine[372]:   notice: LogActions: Stop   
ClusterIP#011(Started 172.28.0.64)
Oct 10 00:51:25 server1 pengine[372]:   notice: process_pe_message:
Calculated Transition 56: (null)
Oct 10 00:51:25 server1 crmd[373]:   notice: run_graph: Transition 56
(Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=unknown):
Complete
Oct 10 00:51:25 server1 crmd[373]:   notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 10 00:53:03 server1 attrd[371]:   notice: attrd_trigger_update: Sending
flush op to all hosts for: standby (true)
Oct 10 00:53:03 server1 attrd[371]:   notice: attrd_perform_update: Sent
update 21: standby=true
Oct 10 00:53:03 server1 crmd[373]:   notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Oct 10 00:53:03 server1 pengine[372]:   notice: unpack_config: On loss of
CCM Quorum: Ignore
Oct 10 00:53:03 server1 pengine[372]:  warning: unpack_rsc_op: Processing
failed op stop for resourceB on server1.cisco.com: unknown error (1)
Oct 10 00:53:03 server1 pengine[372]:  warning: common_apply_stickiness:
Forcing resourceB away from server1.cisco.com after 1000000 failures
(max=1000000)
Oct 10 00:53:03 server1 pengine[372]:   notice: LogActions: Stop   
ClusterIP#011(Started 172.28.0.64)
Oct 10 00:53:03 server1 pengine[372]:   notice: process_pe_message:
Calculated Transition 57: /var/lib/pacemaker/pengine/pe-input-1717.bz2
Oct 10 00:53:03 server1 crmd[373]:   notice: run_graph: Transition 57
(Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-1717.bz2): Complete
Oct 10 00:53:03 server1 crmd[373]:   notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]

Thanks
Lax