[ClusterLabs] stonithd: stonith_choose_peer: Couldn't find anyone to fence <node> with <any>

Thu Aug 13 07:39:36 EDT 2015

Hi,

Brief description of the STONITH problem:

I see two different behaviors with two different STONITH configurations. If
Pacemaker cannot find a device that can STONITH a problematic node, the
node remains up and running. Which is bad, because it must be STONITHed.
As opposite to it, if Pacemaker finds a device that, it thinks, can STONITH
a problematic node, even if the device actually cannot, Pacemaker goes down
after STONITH returns false positive. The Pacemaker shutdowns itself right
after STONITH.
Is it the expected behavior?
Do I need to configure a two more STONITH agents for just rebooting nodes
on which they are running (e.g. with # reboot -f)?

+-------------------------
+ Set-up:
+-------------------------
- two node cluster (node-0 and node-1);
- two fencing (STONITH) agents are configured (STONITH_node-0 and
STONITH_node-1).
- "STONITH_node-0" runs only on "node-1" // this fencing agent can only
fence node-0
- "STONITH_node-1" runs only on "node-0" // this fencing agent can only
fence node-1

+-------------------------
+ Environment:
+-------------------------
- one node - "node-0" - is up and running;
- one STONITH agent - "STONITH_node-1" - is up and running

+-------------------------
+ Test case:
+-------------------------
Simulate error of stopping a resource.
1. start cluster
2. change a RA's script to return "$OCF_ERR_GENERIC" from "Stop" function.
3. stop the resource by "# crm resource stop <resource>"

+-------------------------
+ Actual behavior:
+-------------------------

    CASE 1:
STONITH is configured with:
# crm configure primitive STONITH_node-1 stonith:fence_sbb_hw \
        params pcmk_host_list="node-1" pcmk_host_check="static-list"

After issuing a "stop" command:
    - the resource changes its state to "FAILED"
    - Pacemaker remains working

See below LOG_snippet_1 section.

    CASE 2:
STONITH is configured with:
# crm configure primitive STONITH_node-1 stonith:fence_sbb_hw

After issuing a "stop" command:
    - the resource changes its state to "FAILED"
    - Pacemaker stops working

See below LOG_snippet_2 section.

+-------------------------
+ LOG_snippet_1:
+-------------------------
Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice: handle_request:
    Client crmd.39210.fa40430f wants to fence (reboot) 'node-0' with device
'(any)'
Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice:
initiate_remote_stonith_op:     Initiating remote operation reboot for
node-0: 18cc29db-b7e4-4994-85f1-df891f091a0d (0)
....
Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice:
can_fence_host_with_device:     STONITH_node-1 can not fence (reboot)
node-0: static-list
....
Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice:
stonith_choose_peer:    Couldn't find anyone to fence node-0 with <any>
Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:     info:
call_remote_stonith:    Total remote op timeout set to 60 for fencing of
node node-0 for crmd.39210.18cc29db
Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:     info:
call_remote_stonith:    None of the 1 peers have devices capable of
terminating node-0 for crmd.39210 (0)
....
Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:  warning:
get_xpath_object:   No match for //@st_delegate in /st-reply
Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:    error: remote_op_done:
    Operation reboot of node-0 by node-0 for crmd.39210 at node-0.18cc29db: No
such device
....
Aug 12 16:42:47 [39210] A6-4U24-402-T       crmd:   notice:
tengine_stonith_callback:   Stonith operation
3/23:16:0:0856a484-6b69-4280-b93f-1af9a6a542ee: No such device (-19)
Aug 12 16:42:47 [39210] A6-4U24-402-T       crmd:   notice:
tengine_stonith_callback:   Stonith operation 3 for node-0 failed (No such
device): aborting transition.
Aug 12 16:42:47 [39210] A6-4U24-402-T       crmd:     info:
abort_transition_graph:     Transition aborted: Stonith failed
(source=tengine_stonith_callback:697, 0)
Aug 12 16:42:47 [39210] A6-4U24-402-T       crmd:   notice:
tengine_stonith_notify:     Peer node-0 was not terminated (reboot) by
node-0 for node-0: No such device

+-------------------------
+ LOG_snippet_2:
+-------------------------
Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice: handle_request:
 Client crmd.9009.cabd2154 wants to fence (reboot) 'node-0' with device
'(any)'
Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice:
initiate_remote_stonith_op:  Initiating remote operation reboot for node-0:
3b06d3ce-b100-46d7-874e-96f10348d9e4 (0)
....
Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice:
can_fence_host_with_device:  STONITH_node-1 can fence (reboot) node-0: none
....
Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:     info:
call_remote_stonith:     Total remote op timeout set to 60 for fencing of
node node-0 for crmd.9009.3b06d3ce
Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:     info:
call_remote_stonith:     Requesting that node-0 perform op reboot node-0
for crmd.9009 (72s)
....
Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice:
can_fence_host_with_device:  STONITH_node-1 can fence (reboot) node-0: none
Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:     info:
stonith_fence_get_devices_cb:    Found 1 matching devices for 'node-0'
....
Aug 11 16:09:53 [9005] A6-4U24-402-T   stonithd:   notice: log_operation:
Operation 'reboot' [25511] (call 3 from crmd.9009) for host 'node-0' with
device 'STONITH_node-1' returned: 0 (OK)
Aug 11 16:09:53 [9005] A6-4U24-402-T   stonithd:  warning:
get_xpath_object:    No match for //@st_delegate in /st-reply
Aug 11 16:09:53 [9005] A6-4U24-402-T   stonithd:   notice: remote_op_done:
 Operation reboot of node-0 by node-0 for crmd.9009 at node-0.3b06d3ce: OK
....
Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:   notice:
tengine_stonith_callback:    Stonith operation
3/23:115:0:70ac834e-5b67-4ca6-9080-c98d2b59e2ee: OK (0)
Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:     info:
crm_update_peer_join:    crmd_peer_down: Node node-0[1] - join-2 phase 4 ->
0
Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:     info:
crm_update_peer_expected:    crmd_peer_down: Node node-0[1] - expected
state is now down (was member)
....
Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:     crit:
tengine_stonith_notify:  We were alegedly just fenced by node-0 for node-0!
....
Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd:    error: pcmk_child_exit:
    Child process crmd (9009) exited: Network is down (100)
....
Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd:  warning: pcmk_child_exit:
    Pacemaker child process crmd no longer wishes to be respawned. Shutting
ourselves down.
....
Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd:   notice:
pcmk_shutdown_worker:    Shuting down Pacemaker

Thank you,
Kostya
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20150813/9294b06a/attachment-0002.html>