[Pacemaker] STONITH request failed

Maros Timko timkom at gmail.com
Thu Nov 12 05:07:00 EST 2009


Hi all,

has anybody experienced following.
We were simulating the failure of a resource stop with STONITH enabled. The
node that failed to stop the resource requested to stonith itself. However,
the request was not processed by peer node. It has not logged anything into
log file at that time. Interestingly, this scenario works most of the time.
This was justthe first one but we are not aware of anything that was
modified on the system.
Could anyone try to explain how such situation could occur? Excerpt of logs:
SP1:
pengine[1186]: 2009/11/10_10:09:07 notice: native_print: SP2-stonith
(stonith:external/vsp-ipmi):    Started sp2
pengine[1186]: 2009/11/10_10:09:07 notice: native_print: SP1-stonith
(stonith:external/vsp-ipmi):    Started sp1
pengine[1186]: 2009/11/10_10:09:07 notice: group_print: Resource Group: udom
pengine[1186]: 2009/11/10_10:09:07 notice: native_print:
udom-drbd-udom0    (heartbeat:drbddisk):    Started sp1
pengine[1186]: 2009/11/10_10:09:07 notice: native_print:
udom-drbd-udom1    (heartbeat:drbddisk):    Started sp1
pengine[1186]: 2009/11/10_10:09:07 notice: native_print:
udom-drbd-udom2    (heartbeat:drbddisk):    Started sp1
pengine[1186]: 2009/11/10_10:09:07 notice: native_print:     udom-delay
(ocf::heartbeat:Delay):    Started sp1
pengine[1186]: 2009/11/10_10:09:07 notice: native_print:     udom-vm
(ocf::heartbeat:Xen):    Started sp1
Initiated resource stop:
...
Xen[23346][24689]: 2009/11/10_10:09:38 INFO: Xen domain udom stopped.
crmd[966]: 2009/11/10_10:09:38 info: process_lrm_event: LRM operation
udom-vm_stop_0 (call=27, rc=0, cib-update=121, confirmed=true) complete ok
...
crmd[966]: 2009/11/10_10:09:48 info: process_lrm_event: LRM operation
udom-delay_stop_0 (call=30, rc=0, cib-update=124, confirmed=true) complete
ok
...
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr) 3: State change failed: (-12) Device is held
open by someone
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr) Command '
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr) /sbin/drbdsetup
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr)
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr) 3
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr)
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr) secondary
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr) ' terminated with exit code 11
lrmd[963]: 2009/11/10_10:09:48 info: RA output:
(udom-drbd-udom2:stop:stderr) /sbin/drbdadm secondary udom2: exit code 11,
mapping to 1
crmd[966]: 2009/11/10_10:09:48 info: process_lrm_event: LRM operation
udom-drbd-udom2_stop_0 (call=32, rc=1, cib-update=125, confirmed=true)
complete unknown error
cib[962]: 2009/11/10_10:09:48 info: cib_process_request: Operation complete:
op cib_modify for section 'all' (origin=local/crmd/125): ok (rc=0)
crmd[966]: 2009/11/10_10:09:48 WARN: status_from_rc: Action 22
(udom-drbd-udom2_stop_0) on sp1 failed (target: 0 vs. rc: 1): Error
crmd[966]: 2009/11/10_10:09:48 WARN: update_failcount: Updating failcount
for udom-drbd-udom2 on d3463b4e-6f4a-4ced-8b29-533c3f5c6744 after failed
stop: rc=1 (update=INFINITY, time=1257865788)

Decision to STONITH itself:
pengine[1186]: 2009/11/10_10:10:48 WARN: native_color: Resource SP1-stonith
cannot run anywhere
pengine[1186]: 2009/11/10_10:10:48 WARN: stage6: Scheduling Node sp1 for
STONITH
crmd[966]: 2009/11/10_10:10:48 info: te_fence_node: Executing reboot fencing
operation (36) on sp1 (timeout=60000)
stonithd[964]: 2009/11/10_10:10:48 info: client tengine [pid: 966] requests
a STONITH operation RESET on node sp1
stonithd[964]: 2009/11/10_10:10:48 info: we can't manage sp1, broadcast
request to other nodes
stonithd[964]: 2009/11/10_10:10:48 info: Broadcasting the message succeeded:
require others to stonith node sp1.
stonithd[964]: 2009/11/10_10:11:48 ERROR: Failed to STONITH the node sp1:
optype=RESET, op_result=TIMEOUT
...
crmd[966]: 2009/11/10_10:11:48 info: tengine_stonith_callback: call=-3,
optype=1, node_name=sp1, result=2, node_list=,
action=36:36:0:3f220573-b709-4243-bf70-5cc5b17c7a57
crmd[966]: 2009/11/10_10:11:48 ERROR: tengine_stonith_callback: Stonith of
sp1 failed (2)... aborting transition.
and again
pengine[1186]: 2009/11/10_10:11:48 WARN: stage6: Scheduling Node sp1 for
STONITH
stonithd[964]: 2009/11/10_10:11:48 info: client tengine [pid: 966] requests
a STONITH operation RESET on node sp1
crmd[966]: 2009/11/10_10:11:48 info: te_fence_node: Executing reboot fencing
operation (35) on sp1 (timeout=60000)
stonithd[964]: 2009/11/10_10:12:48 ERROR: Failed to STONITH the node sp1:
optype=RESET, op_result=TIMEOUT
crmd[966]: 2009/11/10_10:12:48 info: tengine_stonith_callback: call=-4,
optype=1, node_name=sp1, result=2, node_list=,
action=35:37:0:3f220573-b709-4243-bf70-5cc5b17c7a57
crmd[966]: 2009/11/10_10:12:48 ERROR: tengine_stonith_callback: Stonith of
sp1 failed (2)... aborting transition.
crmd[966]: 2009/11/10_10:12:48 info: abort_transition_graph:
tengine_stonith_callback:391 - Triggered transition abort (complete=0) :
Stonith failed


While logs of the peer node:
cib[4608]: 2009/11/10_10:08:59 info: retrieveCib: Reading cluster
configuration from: /var/lib/heartbeat/crm/cib.xml (digest:
/var/lib/heartbeat/crm/cib.xml.sig)
cib[4777]: 2009/11/10_10:11:21 info: write_cib_contents: Wrote version
0.67.0 of the CIB to disk (digest: 93bfe20b495b39ace698b42ce04b8e2c)
cib[4777]: 2009/11/10_10:11:21 info: retrieveCib: Reading cluster
configuration from: /var/lib/heartbeat/crm/cib.xml (digest:
/var/lib/heartbeat/crm/cib.xml.sig)
cib[4796]: 2009/11/10_10:13:42 info: write_cib_contents: Wrote version
0.68.0 of the CIB to disk (digest: 9162eeefc80fe9f83876e6324e58c2c7)
cib[4796]: 2009/11/10_10:13:42 info: retrieveCib: Reading cluster
configuration from: /var/lib/heartbeat/crm/cib.xml (digest:
/var/lib/heartbeat/crm/cib.xml.sig)
cib[3110]: 2009/11/10_10:15:35 info: cib_stats: Processed 11 operations
(0.00us average, 0% utilization) in the last 10min

Simply no message to stonithd. Is stonith dependent on something on the
system? Connection between heartbeats worked fine.

# rpm -qa|egrep 'heart|pacem'
libheartbeat2-2.99.2-6.1
heartbeat-common-2.99.2-6.1
heartbeat-resources-2.99.2-6.1
heartbeat-2.99.2-6.1
pacemaker-1.0.2-11.1
libpacemaker3-1.0.2-11.1

# uname -a
Linux sp1 2.6.18-128.7.1.el5xen #1 SMP Mon Aug 24 09:14:33 EDT 2009 x86_64
x86_64 x86_64 GNU/Linux
CentOS5.3

Thanks,

Tino
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20091112/7238dce8/attachment.html>


More information about the Pacemaker mailing list