[ClusterLabs] What is timeout for initial fencing after startup?

Fri Feb 27 21:45:56 UTC 2015

On 2015-02-27 10:40, Andrei Borzenkov wrote:
> I'm testing what happens in 2 node cluster when one node is not
> present at startup. It appears that pacemaker gives up attempt to
> stonith other node after 10 minutes. Where these 10 minutes come from?

I'd say these 10minutes come from the cluster-property 'stonith-timeout'
with its default of 60s and the feature to stop trying to fence a node
after the 10th failed fencing attempt.

Regards,
Andreas

> 
> Feb 27 11:56:33 n1 external/ipmi(stonith_IPMI_n2)[5375]: [6017]:
> ERROR: error executing ipmitool: Error: Unable to establish LAN
> session Unable to set Chassis Power Control to Reset
> ...
> Feb 27 12:06:35 n1 external/ipmi(stonith_IPMI_n2)[2016]: [2594]:
> ERROR: error executing ipmitool: Error: Unable to establish LAN
> session Unable to set Chassis Power Control to Reset
> Feb 27 12:06:36 n1 stonith: external_reset_req: 'ipmi reset' for host
> n2 failed with rc 1
> Feb 27 12:06:57 n1 stonith-ng[4518]:    error: log_operation:
> Operation 'reboot' [2667] (call 12 from crmd.4522) for host 'n2' with
> device 'stonith_IPMI_n2' returned: -62 (Timer expired)
> Feb 27 12:06:57 n1 stonith-ng[4518]:    error: remote_op_done:
> Operation reboot of n2 by n1 for crmd.4522 at n1.e476ee9c: Timer expired
> Feb 27 12:06:57 n1 crmd[4522]:   notice: tengine_stonith_callback:
> Stonith operation 12/42:10:0:37c3b9ca-1aa9-444d-96db-f074a4819e6f:
> Timer expired (-62)
> Feb 27 12:06:57 n1 crmd[4522]:   notice: tengine_stonith_callback:
> Stonith operation 12 for n2 failed (Timer expired): aborting
> transition.
> Feb 27 12:06:57 n1 crmd[4522]:   notice: tengine_stonith_notify: Peer
> n2 was not terminated (reboot) by n1 for n1: Timer expired
> (ref=e476ee9c-9865-400e-871b-f3a7c3c92b8b) by client crmd.4522
> Feb 27 12:06:57 n1 crmd[4522]:   notice: run_graph: Transition 10
> (Complete=3, Pending=0, Fired=0, Skipped=21, Incomplete=1,
> Source=/var/lib/pacemaker/pengine/pe-warn-7.bz2): Stopped
> Feb 27 12:06:57 n1 crmd[4522]:   notice: too_many_st_failures: Too
> many failures to fence n2 (11), giving up
> 
> primitive stonith_IPMI_n2 stonith:external/ipmi \
>         params userid="XXX" passwd="XXX" hostname="n2" ipaddr="192.168.33.10
> " \
>         op start timeout="20" interval="0" \
>         op monitor interval="3600" timeout="20" \
>         op stop timeout="15" interval="0" \
>         meta target-role="Started"
> 
> What surprises me, there are also no failures; I'd expect stonith
> resource in failed state and that is what I observe with pacemaker
> 1.1.9, but not after update to 1.1.11:
> 
> n1:~ # crm_mon -1f
> Last updated: Fri Feb 27 12:22:19 2015
> Last change: Wed Feb 25 19:39:19 2015 by root via crm_attribute on n2
> Stack: classic openais (with plugin)
> Current DC: n1 - partition WITHOUT quorum
> Version: 1.1.11-3ca8c3b
> 2 Nodes configured, 2 expected votes
> 13 Resources configured
> 
> 
> Node n2: UNCLEAN (offline)
> Online: [ n1 ]
> 
> Migration summary:
> * Node n1:
> 
> Hmm ... it appears pacemaker continues to try stonith again after 15
> minutes ... and retries it every 15 minutes thereafter. Where *this*
> timer (15 minutes) come from? I definitely see neither 15m nor 900s in
> configuration.
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20150227/f6f41580/attachment-0002.sig>