Hi all,<br><br>has anybody experienced following. <br>We were simulating the failure of a resource stop with STONITH enabled. The node that failed to stop the resource requested to stonith itself. However, the request was not processed by peer node. It has not logged anything into log file at that time. Interestingly, this scenario works most of the time. This was justthe first one but we are not aware of anything that was modified on the system.<br>
Could anyone try to explain how such situation could occur? Excerpt of logs:<br>SP1:<br>pengine[1186]: 2009/11/10_10:09:07 notice: native_print: SP2-stonith (stonith:external/vsp-ipmi): Started sp2<br>pengine[1186]: 2009/11/10_10:09:07 notice: native_print: SP1-stonith (stonith:external/vsp-ipmi): Started sp1<br>
pengine[1186]: 2009/11/10_10:09:07 notice: group_print: Resource Group: udom<br>pengine[1186]: 2009/11/10_10:09:07 notice: native_print: udom-drbd-udom0 (heartbeat:drbddisk): Started sp1<br>pengine[1186]: 2009/11/10_10:09:07 notice: native_print: udom-drbd-udom1 (heartbeat:drbddisk): Started sp1<br>
pengine[1186]: 2009/11/10_10:09:07 notice: native_print: udom-drbd-udom2 (heartbeat:drbddisk): Started sp1<br>pengine[1186]: 2009/11/10_10:09:07 notice: native_print: udom-delay (ocf::heartbeat:Delay): Started sp1<br>
pengine[1186]: 2009/11/10_10:09:07 notice: native_print: udom-vm (ocf::heartbeat:Xen): Started sp1<br>Initiated resource stop:<br>...<br>Xen[23346][24689]: 2009/11/10_10:09:38 INFO: Xen domain udom stopped.<br>crmd[966]: 2009/11/10_10:09:38 info: process_lrm_event: LRM operation udom-vm_stop_0 (call=27, rc=0, cib-update=121, confirmed=true) complete ok<br>
...<br>crmd[966]: 2009/11/10_10:09:48 info: process_lrm_event: LRM operation udom-delay_stop_0 (call=30, rc=0, cib-update=124, confirmed=true) complete ok<br>...<br>lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) 3: State change failed: (-12) Device is held open by someone<br>
lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) Command '<br>lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) /sbin/drbdsetup<br>lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) <br>
lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) 3<br>lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) <br>lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) secondary<br>
lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) ' terminated with exit code 11<br>lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) /sbin/drbdadm secondary udom2: exit code 11, mapping to 1<br>
crmd[966]: 2009/11/10_10:09:48 info: process_lrm_event: LRM operation udom-drbd-udom2_stop_0 (call=32, rc=1, cib-update=125, confirmed=true) complete unknown error<br>cib[962]: 2009/11/10_10:09:48 info: cib_process_request: Operation complete: op cib_modify for section 'all' (origin=local/crmd/125): ok (rc=0)<br>
crmd[966]: 2009/11/10_10:09:48 WARN: status_from_rc: Action 22 (udom-drbd-udom2_stop_0) on sp1 failed (target: 0 vs. rc: 1): Error<br>crmd[966]: 2009/11/10_10:09:48 WARN: update_failcount: Updating failcount for udom-drbd-udom2 on d3463b4e-6f4a-4ced-8b29-533c3f5c6744 after failed stop: rc=1 (update=INFINITY, time=1257865788)<br>
<br>Decision to STONITH itself:<br>pengine[1186]: 2009/11/10_10:10:48 WARN: native_color: Resource SP1-stonith cannot run anywhere<br>pengine[1186]: 2009/11/10_10:10:48 WARN: stage6: Scheduling Node sp1 for STONITH<br>crmd[966]: 2009/11/10_10:10:48 info: te_fence_node: Executing reboot fencing operation (36) on sp1 (timeout=60000)<br>
stonithd[964]: 2009/11/10_10:10:48 info: client tengine [pid: 966] requests a STONITH operation RESET on node sp1<br>stonithd[964]: 2009/11/10_10:10:48 info: we can't manage sp1, broadcast request to other nodes<br>stonithd[964]: 2009/11/10_10:10:48 info: Broadcasting the message succeeded: require others to stonith node sp1.<br>
stonithd[964]: 2009/11/10_10:11:48 ERROR: Failed to STONITH the node sp1: optype=RESET, op_result=TIMEOUT<br>...<br>crmd[966]: 2009/11/10_10:11:48 info: tengine_stonith_callback: call=-3, optype=1, node_name=sp1, result=2, node_list=, action=36:36:0:3f220573-b709-4243-bf70-5cc5b17c7a57<br>
crmd[966]: 2009/11/10_10:11:48 ERROR: tengine_stonith_callback: Stonith of sp1 failed (2)... aborting transition.<br>and again<br>pengine[1186]: 2009/11/10_10:11:48 WARN: stage6: Scheduling Node sp1 for STONITH<br>stonithd[964]: 2009/11/10_10:11:48 info: client tengine [pid: 966] requests a STONITH operation RESET on node sp1<br>
crmd[966]: 2009/11/10_10:11:48 info: te_fence_node: Executing reboot fencing operation (35) on sp1 (timeout=60000)<br>stonithd[964]: 2009/11/10_10:12:48 ERROR: Failed to STONITH the node sp1: optype=RESET, op_result=TIMEOUT<br>
crmd[966]: 2009/11/10_10:12:48 info: tengine_stonith_callback: call=-4, optype=1, node_name=sp1, result=2, node_list=, action=35:37:0:3f220573-b709-4243-bf70-5cc5b17c7a57<br>crmd[966]: 2009/11/10_10:12:48 ERROR: tengine_stonith_callback: Stonith of sp1 failed (2)... aborting transition.<br>
crmd[966]: 2009/11/10_10:12:48 info: abort_transition_graph: tengine_stonith_callback:391 - Triggered transition abort (complete=0) : Stonith failed<br><br><br>While logs of the peer node:<br>cib[4608]: 2009/11/10_10:08:59 info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig)<br>
cib[4777]: 2009/11/10_10:11:21 info: write_cib_contents: Wrote version 0.67.0 of the CIB to disk (digest: 93bfe20b495b39ace698b42ce04b8e2c)<br>cib[4777]: 2009/11/10_10:11:21 info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig)<br>
cib[4796]: 2009/11/10_10:13:42 info: write_cib_contents: Wrote version 0.68.0 of the CIB to disk (digest: 9162eeefc80fe9f83876e6324e58c2c7)<br>cib[4796]: 2009/11/10_10:13:42 info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig)<br>
cib[3110]: 2009/11/10_10:15:35 info: cib_stats: Processed 11 operations (0.00us average, 0% utilization) in the last 10min<br><br>Simply no message to stonithd. Is stonith dependent on something on the system? Connection between heartbeats worked fine.<br>
<br># rpm -qa|egrep 'heart|pacem'<br>libheartbeat2-2.99.2-6.1<br>heartbeat-common-2.99.2-6.1<br>heartbeat-resources-2.99.2-6.1<br>heartbeat-2.99.2-6.1<br>pacemaker-1.0.2-11.1<br>libpacemaker3-1.0.2-11.1<br><br># uname -a<br>
Linux sp1 2.6.18-128.7.1.el5xen #1 SMP Mon Aug 24 09:14:33 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux<br>CentOS5.3<br><br>Thanks,<br><br>Tino<br><br>