[ClusterLabs] What is timeout for initial fencing after startup?

Andrei Borzenkov arvidjaar at gmail.com
Fri Feb 27 04:40:50 EST 2015


I'm testing what happens in 2 node cluster when one node is not
present at startup. It appears that pacemaker gives up attempt to
stonith other node after 10 minutes. Where these 10 minutes come from?

Feb 27 11:56:33 n1 external/ipmi(stonith_IPMI_n2)[5375]: [6017]:
ERROR: error executing ipmitool: Error: Unable to establish LAN
session Unable to set Chassis Power Control to Reset
...
Feb 27 12:06:35 n1 external/ipmi(stonith_IPMI_n2)[2016]: [2594]:
ERROR: error executing ipmitool: Error: Unable to establish LAN
session Unable to set Chassis Power Control to Reset
Feb 27 12:06:36 n1 stonith: external_reset_req: 'ipmi reset' for host
n2 failed with rc 1
Feb 27 12:06:57 n1 stonith-ng[4518]:    error: log_operation:
Operation 'reboot' [2667] (call 12 from crmd.4522) for host 'n2' with
device 'stonith_IPMI_n2' returned: -62 (Timer expired)
Feb 27 12:06:57 n1 stonith-ng[4518]:    error: remote_op_done:
Operation reboot of n2 by n1 for crmd.4522 at n1.e476ee9c: Timer expired
Feb 27 12:06:57 n1 crmd[4522]:   notice: tengine_stonith_callback:
Stonith operation 12/42:10:0:37c3b9ca-1aa9-444d-96db-f074a4819e6f:
Timer expired (-62)
Feb 27 12:06:57 n1 crmd[4522]:   notice: tengine_stonith_callback:
Stonith operation 12 for n2 failed (Timer expired): aborting
transition.
Feb 27 12:06:57 n1 crmd[4522]:   notice: tengine_stonith_notify: Peer
n2 was not terminated (reboot) by n1 for n1: Timer expired
(ref=e476ee9c-9865-400e-871b-f3a7c3c92b8b) by client crmd.4522
Feb 27 12:06:57 n1 crmd[4522]:   notice: run_graph: Transition 10
(Complete=3, Pending=0, Fired=0, Skipped=21, Incomplete=1,
Source=/var/lib/pacemaker/pengine/pe-warn-7.bz2): Stopped
Feb 27 12:06:57 n1 crmd[4522]:   notice: too_many_st_failures: Too
many failures to fence n2 (11), giving up

primitive stonith_IPMI_n2 stonith:external/ipmi \
        params userid="XXX" passwd="XXX" hostname="n2" ipaddr="192.168.33.10
" \
        op start timeout="20" interval="0" \
        op monitor interval="3600" timeout="20" \
        op stop timeout="15" interval="0" \
        meta target-role="Started"

What surprises me, there are also no failures; I'd expect stonith
resource in failed state and that is what I observe with pacemaker
1.1.9, but not after update to 1.1.11:

n1:~ # crm_mon -1f
Last updated: Fri Feb 27 12:22:19 2015
Last change: Wed Feb 25 19:39:19 2015 by root via crm_attribute on n2
Stack: classic openais (with plugin)
Current DC: n1 - partition WITHOUT quorum
Version: 1.1.11-3ca8c3b
2 Nodes configured, 2 expected votes
13 Resources configured


Node n2: UNCLEAN (offline)
Online: [ n1 ]

Migration summary:
* Node n1:

Hmm ... it appears pacemaker continues to try stonith again after 15
minutes ... and retries it every 15 minutes thereafter. Where *this*
timer (15 minutes) come from? I definitely see neither 15m nor 900s in
configuration.




More information about the Users mailing list