[ClusterLabs] stonithd: stonith_choose_peer: Couldn't find anyone to fence <node> with <any>

Thu Aug 13 12:13:52 UTC 2015

On Thu, Aug 13, 2015 at 2:39 PM, Kostiantyn Ponomarenko
<konstantin.ponomarenko at gmail.com> wrote:
> Hi,
>
> Brief description of the STONITH problem:
>
> I see two different behaviors with two different STONITH configurations. If
> Pacemaker cannot find a device that can STONITH a problematic node, the node
> remains up and running. Which is bad, because it must be STONITHed.

Then make sure it can be stonithd. Add additional stonith agent using
independent communication channel.

> As opposite to it, if Pacemaker finds a device that, it thinks, can STONITH
> a problematic node, even if the device actually cannot, Pacemaker goes down
> after STONITH returns false positive. The Pacemaker shutdowns itself right
> after STONITH.

I have no idea what fence_sbb_hw is or does, but it apparently started
rebooting your system. I expect that as part of reboot pacemaker is
stopped as well.

> Is it the expected behavior?

The former - sure. The latter - depends on what your stonith agent does.

> Do I need to configure a two more STONITH agents for just rebooting nodes on
> which they are running (e.g. with # reboot -f)?
>

It is useless in most cases. Fencing is for *other* surviving nodes to
ensure known state of suspected node. What this node does by itself
really does not matter. Primary use case is when communication with
node is lost at which point there is no way to know that node
performed shut down, rebooted or did anything else by its own.

Only if there is still communication path between nodes and you are
absolutely sure that failure of this communication path also means
failure to start resources may suicide be useful. Example is sbd or
similar quorum disk implementation based on shared storage where loss
of heartbit /probably/ means loss of acess to storage as well.

>
>
> +-------------------------
> + Set-up:
> +-------------------------
> - two node cluster (node-0 and node-1);
> - two fencing (STONITH) agents are configured (STONITH_node-0 and
> STONITH_node-1).
> - "STONITH_node-0" runs only on "node-1" // this fencing agent can only
> fence node-0
> - "STONITH_node-1" runs only on "node-0" // this fencing agent can only
> fence node-1
>
> +-------------------------
> + Environment:
> +-------------------------
> - one node - "node-0" - is up and running;
> - one STONITH agent - "STONITH_node-1" - is up and running
>
> +-------------------------
> + Test case:
> +-------------------------
> Simulate error of stopping a resource.
> 1. start cluster
> 2. change a RA's script to return "$OCF_ERR_GENERIC" from "Stop" function.
> 3. stop the resource by "# crm resource stop <resource>"
>
> +-------------------------
> + Actual behavior:
> +-------------------------
>
>     CASE 1:
> STONITH is configured with:
> # crm configure primitive STONITH_node-1 stonith:fence_sbb_hw \
>         params pcmk_host_list="node-1" pcmk_host_check="static-list"
>
> After issuing a "stop" command:
>     - the resource changes its state to "FAILED"
>     - Pacemaker remains working
>
> See below LOG_snippet_1 section.
>
>
>     CASE 2:
> STONITH is configured with:
> # crm configure primitive STONITH_node-1 stonith:fence_sbb_hw
>
> After issuing a "stop" command:
>     - the resource changes its state to "FAILED"
>     - Pacemaker stops working
>
> See below LOG_snippet_2 section.
>
>
> +-------------------------
> + LOG_snippet_1:
> +-------------------------
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice: handle_request:
> Client crmd.39210.fa40430f wants to fence (reboot) 'node-0' with device
> '(any)'
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice:
> initiate_remote_stonith_op:     Initiating remote operation reboot for
> node-0: 18cc29db-b7e4-4994-85f1-df891f091a0d (0)
> ....
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice:
> can_fence_host_with_device:     STONITH_node-1 can not fence (reboot)
> node-0: static-list
> ....
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice:
> stonith_choose_peer:    Couldn't find anyone to fence node-0 with <any>
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:     info:
> call_remote_stonith:    Total remote op timeout set to 60 for fencing of
> node node-0 for crmd.39210.18cc29db
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:     info:
> call_remote_stonith:    None of the 1 peers have devices capable of
> terminating node-0 for crmd.39210 (0)
> ....
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:  warning:
> get_xpath_object:   No match for //@st_delegate in /st-reply
> Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:    error: remote_op_done:
> Operation reboot of node-0 by node-0 for crmd.39210 at node-0.18cc29db: No such
> device
> ....
> Aug 12 16:42:47 [39210] A6-4U24-402-T       crmd:   notice:
> tengine_stonith_callback:   Stonith operation
> 3/23:16:0:0856a484-6b69-4280-b93f-1af9a6a542ee: No such device (-19)
> Aug 12 16:42:47 [39210] A6-4U24-402-T       crmd:   notice:
> tengine_stonith_callback:   Stonith operation 3 for node-0 failed (No such
> device): aborting transition.
> Aug 12 16:42:47 [39210] A6-4U24-402-T       crmd:     info:
> abort_transition_graph:     Transition aborted: Stonith failed
> (source=tengine_stonith_callback:697, 0)
> Aug 12 16:42:47 [39210] A6-4U24-402-T       crmd:   notice:
> tengine_stonith_notify:     Peer node-0 was not terminated (reboot) by
> node-0 for node-0: No such device
>
>
> +-------------------------
> + LOG_snippet_2:
> +-------------------------
> Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice: handle_request:
> Client crmd.9009.cabd2154 wants to fence (reboot) 'node-0' with device
> '(any)'
> Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice:
> initiate_remote_stonith_op:  Initiating remote operation reboot for node-0:
> 3b06d3ce-b100-46d7-874e-96f10348d9e4 (0)
> ....
> Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice:
> can_fence_host_with_device:  STONITH_node-1 can fence (reboot) node-0: none
> ....
> Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:     info:
> call_remote_stonith:     Total remote op timeout set to 60 for fencing of
> node node-0 for crmd.9009.3b06d3ce
> Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:     info:
> call_remote_stonith:     Requesting that node-0 perform op reboot node-0 for
> crmd.9009 (72s)
> ....
> Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice:
> can_fence_host_with_device:  STONITH_node-1 can fence (reboot) node-0: none
> Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:     info:
> stonith_fence_get_devices_cb:    Found 1 matching devices for 'node-0'
> ....
> Aug 11 16:09:53 [9005] A6-4U24-402-T   stonithd:   notice: log_operation:
> Operation 'reboot' [25511] (call 3 from crmd.9009) for host 'node-0' with
> device 'STONITH_node-1' returned: 0 (OK)
> Aug 11 16:09:53 [9005] A6-4U24-402-T   stonithd:  warning: get_xpath_object:
> No match for //@st_delegate in /st-reply
> Aug 11 16:09:53 [9005] A6-4U24-402-T   stonithd:   notice: remote_op_done:
> Operation reboot of node-0 by node-0 for crmd.9009 at node-0.3b06d3ce: OK
> ....
> Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:   notice:
> tengine_stonith_callback:    Stonith operation
> 3/23:115:0:70ac834e-5b67-4ca6-9080-c98d2b59e2ee: OK (0)
> Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:     info:
> crm_update_peer_join:    crmd_peer_down: Node node-0[1] - join-2 phase 4 ->
> 0
> Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:     info:
> crm_update_peer_expected:    crmd_peer_down: Node node-0[1] - expected state
> is now down (was member)
> ....
> Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:     crit:
> tengine_stonith_notify:  We were alegedly just fenced by node-0 for node-0!
> ....
> Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd:    error: pcmk_child_exit:
> Child process crmd (9009) exited: Network is down (100)
> ....
> Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd:  warning: pcmk_child_exit:
> Pacemaker child process crmd no longer wishes to be respawned. Shutting
> ourselves down.
> ....
> Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd:   notice:
> pcmk_shutdown_worker:    Shuting down Pacemaker
>
>
> Thank you,
> Kostya
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>