[Pacemaker] Time out issue while stopping resource in pacemaker

Mon Oct 13 01:34:42 EDT 2014

On 10 Oct 2014, at 7:03 pm, Lax <lkota at cisco.com> wrote:

> Andrew Beekhof <andrew at ...> writes:
> 
>> 
>> 
>> On 10 Oct 2014, at 12:12 pm, Lax <lkota at ...> wrote:
>> 
>>> Hi All,
>>> 
>>> I ran into a time out issue while failing over from master to the peer
>>> server and I have a 2 node setup with 2 resources. Though it was working all
>>> along, this was the first time this issue is seen for me.
>>> 
>>> It fail with following error 'error: process_lrm_event: LRM operation
>>> resourceB_stop_0 (40) Timed Out (timeout=20000ms)'.
>>> 
>> 
>> Have you considered making the timeout longer?
>> 
>>> 
>>> 
>>> Here is the complete log snippet from pacemaker, appreciate your help on
> this.
>>> 
>>> 
>>> Oct  9 14:57:38 server1 cib[368]:   notice: cib:diff: Diff: +++ 0.3.1
>>> 4e9bfa03cf2fef61843c18e127044d81
>>> Oct  9 14:57:38 server1 cib[368]:   notice: cib:diff: -- <cib
>>> admin_epoch="0" epoch="2" num_updates="8" />
>>> Oct  9 14:57:38 server1 crmd[373]:   notice: do_state_transition: State
>>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
>>> origin=abort_transition_graph ]
>>> Oct  9 14:57:38 server1 cib[368]:   notice: cib:diff: ++        
>>> <instance_attributes id="nodes-server1" >
>>> Oct  9 14:57:38 server1 cib[368]:   notice: cib:diff: ++           <nvpair
>>> id="nodes-server1-standby" name="standby" value="true" />
>>> Oct  9 14:57:38 server1 cib[368]:   notice: cib:diff: ++        
>>> </instance_attributes>
>>> Oct  9 14:57:38 server1 pengine[372]:   notice: unpack_config: On loss of
>>> CCM Quorum: Ignore
>>> Oct  9 14:57:38 server1 pengine[372]:   notice: LogActions: Move   
>>> ClusterIP#011(Started server1 -> 172.28.0.64)
>>> Oct  9 14:57:38 server1 pengine[372]:   notice: LogActions: Move   
>>> resourceB#011(Started server1 -> 172.28.0.64)
>>> Oct  9 14:57:38 server1 pengine[372]:   notice: process_pe_message:
>>> Calculated Transition 11: /var/lib/pacemaker/pengine/pe-input-1710.bz2
>>> Oct  9 14:57:58 server1 lrmd[370]:  warning: child_timeout_callback:
>>> resourceB_stop_0 process (PID 17327) timed out
>>> Oct  9 14:57:58 server1 lrmd[370]:  warning: operation_finished:
>>> resourceB_stop_0:17327 - timed out after 20000ms
>>> Oct  9 14:57:58 server1 lrmd[370]:   notice: operation_finished:
>>> resourceB_stop_0:17327 [   % Total    % Received % Xferd  Average Speed  
>>> Time    Time     Time  Current ]
>>> Oct  9 14:57:58 server1 lrmd[370]:   notice: operation_finished:
>>> resourceB_stop_0:17327 [                                  Dload  Upload  
>>> Total   Spent    Left  Speed ]
>>> Oct  9 14:57:58 server1 lrmd[370]:   notice: operation_finished:
>>> resourceB_stop_0:17327 [ #015  0     0    0     0    0     0      0      0
>>> --:--:-- --:--:-- --:--:--     0#015  0     0    0     0    0     0      0 
>>>   0 --:--:--  0:00:01 --:--:--     0#015  0     0    0     0    0     0  
>>>  0      0 --:--:--  0:00:02 --:--:--     0#015  0     0    0     0    0  
>>> 0      0      0 --:--:--  0:00:03 --:--:--     0#015  0     0    0     0 
>>> 0     0      0      0 --:--:--  0:00:04 --:--:--     0#015  0     0    0 
>>>  0    0     0      0      0 --:--:--  0:00:05 -
>>> Oct  9 14:57:58 server1 crmd[373]:    error: process_lrm_event: LRM
>>> operation resourceB_stop_0 (40) Timed Out (timeout=20000ms)
>>> Oct  9 14:57:58 server1 crmd[373]:  warning: status_from_rc: Action 10
>>> (resourceB_stop_0) on server1 failed (target: 0 vs. rc: 1): Error
>>> Oct  9 14:57:58 server1 crmd[373]:  warning: update_failcount: Updating
>>> failcount for resourceB on server1 after failed stop: rc=1 (update=INFINITY,
>>> time=1412891878)
>>> Oct  9 14:57:58 server1 attrd[371]:   notice: attrd_trigger_update: Sending
>>> flush op to all hosts for: fail-count-resourceB (INFINITY)
>>> Oct  9 14:57:58 server1 crmd[373]:  warning: update_failcount: Updating
>>> failcount for resourceB on server1 after failed stop: rc=1 (update=INFINITY,
>>> time=1412891878)
>>> Oct  9 14:57:58 server1 crmd[373]:   notice: run_graph: Transition 11
>>> (Complete=2, Pending=0, Fired=0, Skipped=9, Incomplete=0,
>>> Source=/var/lib/pacemaker/pengine/pe-input-1710.bz2): Stopped
>>> Oct  9 14:57:58 server1 attrd[371]:   notice: attrd_perform_update: Sent
>>> update 11: fail-count-resourceB=INFINITY
>>> 
>>> 
>>> Thanks
>>> Lax
>>> 
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at ...
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at ...
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> 
> 
> Thanks for getting back Andrew.
> 
> I have only given timeout for monitor, but not for stop in the resource
> definition. Do you mean I increase timeout for stop?

Right.

> 
> Also when I try to force stop pacemaker service, it keeps saying 'Waiting
> for shutdown of managed resources;' and does not stop.
> 
> Pacemaker log says failed to stop because of unknown error

I'm guessing you don't have stonith?

The underlying philosophy is that the services pacemaker manages need to exit before pacemaker can.
If the service can't stop, it would be dishonest of pacemaker to do so.

If you had fencing, it would have been able to clean up after a failed stop and allow the rest of the cluster to continue.

> 
> Oct 10 00:36:25 server1 crmd[373]:   notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Oct 10 00:51:25 server1 crmd[373]:   notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
> origin=crm_timer_popped ]
> Oct 10 00:51:25 server1 pengine[372]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct 10 00:51:25 server1 pengine[372]:  warning: unpack_rsc_op: Processing
> failed op stop for resourceB on server1.cisco.com: unknown error (1)
> Oct 10 00:51:25 server1 pengine[372]:  warning: common_apply_stickiness:
> Forcing resourceB away from server1.cisco.com after 1000000 failures
> (max=1000000)
> Oct 10 00:51:25 server1 pengine[372]:   notice: LogActions: Stop   
> ClusterIP#011(Started 172.28.0.64)
> Oct 10 00:51:25 server1 pengine[372]:   notice: process_pe_message:
> Calculated Transition 56: (null)
> Oct 10 00:51:25 server1 crmd[373]:   notice: run_graph: Transition 56
> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=unknown):
> Complete
> Oct 10 00:51:25 server1 crmd[373]:   notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Oct 10 00:53:03 server1 attrd[371]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: standby (true)
> Oct 10 00:53:03 server1 attrd[371]:   notice: attrd_perform_update: Sent
> update 21: standby=true
> Oct 10 00:53:03 server1 crmd[373]:   notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=abort_transition_graph ]
> Oct 10 00:53:03 server1 pengine[372]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct 10 00:53:03 server1 pengine[372]:  warning: unpack_rsc_op: Processing
> failed op stop for resourceB on server1.cisco.com: unknown error (1)
> Oct 10 00:53:03 server1 pengine[372]:  warning: common_apply_stickiness:
> Forcing resourceB away from server1.cisco.com after 1000000 failures
> (max=1000000)
> Oct 10 00:53:03 server1 pengine[372]:   notice: LogActions: Stop   
> ClusterIP#011(Started 172.28.0.64)
> Oct 10 00:53:03 server1 pengine[372]:   notice: process_pe_message:
> Calculated Transition 57: /var/lib/pacemaker/pengine/pe-input-1717.bz2
> Oct 10 00:53:03 server1 crmd[373]:   notice: run_graph: Transition 57
> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-1717.bz2): Complete
> Oct 10 00:53:03 server1 crmd[373]:   notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> 
> 
> Thanks
> Lax
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141013/7ea8f436/attachment-0007.sig>