[Pacemaker] Pacemaker doesn't actually call STONITH, instead in stops itself

Kostiantyn Ponomarenko konstantin.ponomarenko at gmail.com
Wed Feb 11 11:33:31 EST 2015


Hi all,

I've faced a problem when a node is not actually rebooted in case a
resource fails to stop on it.
A fence agent is a self-written. And it works in case of network outage and
all other cases.
I went through all the logs on both nodes and I couldn't understand why
node-0 is not actually rebooted.
I would be appreciated for some help here.
Bellow are two brief (most interesting from my point) snips from the logs.
I also attached log files and screenshot of "crm_mon" on node-1.


The problem:
-------------------
"stop" action for "sm1dh" fails on "node-0", and "node-0" is not actually
rebooted by "node-1".


The setup:
---------------
There are two nodes: node-0,node-1


Fence agents are configured with:
-------------------------------------
crm configure primitive STONITH_node-1 stonith:fence_avid_sbb_hw
crm configure primitive STONITH_node-0 stonith:fence_avid_sbb_hw \
params delay="10"

crm configure location dont_run_STONITH_node-1_on_node-1 STONITH_node-1
-inf: node-1
crm configure location dont_run_STONITH_node-0_on_node-0 STONITH_node-0
-inf: node-0


Few lines from the vim /var/log/cluster/corosync.log on "node-0":
-------------------------------------------------------------------------------------------
Feb 10 19:29:40 [3204] isis-seth943f    pengine:     info: native_print:
     sm1dh   (ocf::avid:diskHelper): FAILED node-0
...
Feb 10 19:29:40 [3201] isis-seth943f   stonithd:   notice:
handle_request: Client
crmd.3205.09022f74 wants to fence (reboot) 'node-0' with device '(any)'
Feb 10 19:29:40 [3201] isis-seth943f   stonithd:   notice:
initiate_remote_stonith_op: Initiating remote operation reboot for node-0:
51063a89-0df0-4dd7-8f22-667ca5db05f0 (0)
Feb 10 19:29:41 [3201] isis-seth943f   stonithd:     info:
process_remote_stonith_query: Query result 2 of 2 from node-1 for
node-0/reboot (1 devices) 51063a89-0df0-4dd7-8f22-667ca5db05f0
...
Feb 10 19:29:51 [3205] isis-seth943f       crmd:     crit:
tengine_stonith_notify: We were alegedly just fenced by node-1 for node-0!
...
Feb 10 19:29:51 [3198] isis-seth943f pacemakerd:    error:
pcmk_child_exit: Child
process crmd (3205) exited: Network is down (100)
Feb 10 19:29:51 [3198] isis-seth943f pacemakerd:  warning:
pcmk_child_exit: Pacemaker
child process crmd no longer wishes to be respawned. Shutting ourselves down


Few lines from the vim /var/log/cluster/corosync.log on "node-1":
-------------------------------------------------------------------------------------------
Feb 10 19:28:15 [3184] isis-seth944b   stonithd:   notice:
log_operation: Operation
'reboot' [4596] (call 2 from crmd.3205) for host 'node-0' with device
'STONITH_node-0' returned: 0 (OK)
Feb 10 19:28:15 [3184] isis-seth944b   stonithd:  warning:
get_xpath_object: No match for //@st_delegate in /st-reply
Feb 10 19:28:15 [3184] isis-seth944b   stonithd:   notice:
remote_op_done: Operation
reboot of node-0 by node-1 for crmd.3205 at node-0.51063a89: OK
Feb 10 19:28:15 [3188] isis-seth944b       crmd:   notice:
tengine_stonith_notify: Peer node-0 was terminated (reboot) by node-1 for
node-0: OK (ref=51063a89-0df0-4dd7-8f22-667ca5db05f0) by client crmd.3205


Time difference between the nodes (sorry for that):
------------------------------------------------------------------------
node-0: t
node-1: t - 97 seconds


Thank you,
Kostya
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150211/a6eed335/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: crm_mon.png
Type: image/png
Size: 34900 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150211/a6eed335/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: node-0-brief-summary
Type: application/octet-stream
Size: 11118 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150211/a6eed335/attachment-0008.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: node-0-logs
Type: application/octet-stream
Size: 72387 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150211/a6eed335/attachment-0009.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: node-1-brief-summary
Type: application/octet-stream
Size: 3440 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150211/a6eed335/attachment-0010.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: node-1-logs
Type: application/octet-stream
Size: 112893 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150211/a6eed335/attachment-0011.obj>


More information about the Pacemaker mailing list