[ClusterLabs] stonithd: stonith_choose_peer: Couldn't find anyone to fence <node> with <any>

Fri Aug 21 09:01:43 EDT 2015

Hi Andrew,

>> Recent versions allow this depending on what the configured fencing
devices report.
So the device should be able/configured/report that it can STONITH the node
on which it is running?

>> You left out “but the devices reports that it did”.  Your fencing agent
needs to report the truth
Yes, that was a hole in the configuration - I didn't specify
pcmk_host_list="node-1"
and pcmk_host_check="static-list".
But the "safety check" that you mentioned before worked perfect, so I
didn't notice my mistake in the configuration.

Now I see that I shouldn't rely on the "safety check" and should have
a proper configuration for STONITH.
The thing is - I am trying to understand the way I should modify my config.
Using two-node cluster solutions it is possible to have only one node
running, and my current STONITH agents, with "pcmk_host_list" and
"pcmk_host_check", won't work.
Even more, it will lead to the situation when a started cluster couldn't
find a device to reboot itself (say "stop" failed) remains working while
must be STONITHed.

Could you help me to find the best way to "self-stonithing"?
Will it be sufficient to create another stonith agent with will issue
"reboot -f"?

Thank you,
Kostya

On Mon, Aug 17, 2015 at 1:15 AM, Andrew Beekhof <andrew at beekhof.net> wrote:

>
> > On 13 Aug 2015, at 9:39 pm, Kostiantyn Ponomarenko <
> konstantin.ponomarenko at gmail.com> wrote:
> >
> > Hi,
> >
> > Brief description of the STONITH problem:
> >
> > I see two different behaviors with two different STONITH configurations.
> If Pacemaker cannot find a device that can STONITH a problematic node, the
> node remains up and running. Which is bad, because it must be STONITHed.
> > As opposite to it, if Pacemaker finds a device that, it thinks, can
> STONITH a problematic node, even if the device actually cannot,
>
> You left out “but the devices reports that it did”.  Your fencing agent
> needs to report the truth.
>
> > Pacemaker goes down after STONITH returns false positive. The Pacemaker
> shutdowns itself right after STONITH.
> > Is it the expected behavior?
>
> Yes, its a safety check:
>
>     Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:     crit:
> tengine_stonith_notify:  We were alegedly just fenced by node-0 for node-0!
>
>
> > Do I need to configure a two more STONITH agents for just rebooting
> nodes on which they are running (e.g. with # reboot -f)?
> >
> >
> >
> > +-------------------------
> > + Set-up:
> > +-------------------------
> > - two node cluster (node-0 and node-1);
> > - two fencing (STONITH) agents are configured (STONITH_node-0 and
> STONITH_node-1).
> > - "STONITH_node-0" runs only on "node-1" // this fencing agent can only
> fence node-0
> > - "STONITH_node-1" runs only on "node-0" // this fencing agent can only
> fence node-1
> >
> > +-------------------------
> > + Environment:
> > +-------------------------
> > - one node - "node-0" - is up and running;
> > - one STONITH agent - "STONITH_node-1" - is up and running
> >
> > +-------------------------
> > + Test case:
> > +-------------------------
> > Simulate error of stopping a resource.
> > 1. start cluster
> > 2. change a RA's script to return "$OCF_ERR_GENERIC" from "Stop"
> function.
> > 3. stop the resource by "# crm resource stop <resource>"
> >
> > +-------------------------
> > + Actual behavior:
> > +-------------------------
> >
> >     CASE 1:
> > STONITH is configured with:
> > # crm configure primitive STONITH_node-1 stonith:fence_sbb_hw \
> >         params pcmk_host_list="node-1" pcmk_host_check="static-list"
> >
> > After issuing a "stop" command:
> >     - the resource changes its state to "FAILED"
> >     - Pacemaker remains working
> >
> > See below LOG_snippet_1 section.
> >
> >
> >     CASE 2:
> > STONITH is configured with:
> > # crm configure primitive STONITH_node-1 stonith:fence_sbb_hw
> >
> > After issuing a "stop" command:
> >     - the resource changes its state to "FAILED"
> >     - Pacemaker stops working
> >
> > See below LOG_snippet_2 section.
> >
> >
> > +-------------------------
> > + LOG_snippet_1:
> > +-------------------------
> > Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice:
> handle_request:     Client crmd.39210.fa40430f wants to fence (reboot)
> 'node-0' with device '(any)'
> > Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice:
> initiate_remote_stonith_op:     Initiating remote operation reboot for
> node-0: 18cc29db-b7e4-4994-85f1-df891f091a0d (0)
> > ....
> > Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice:
> can_fence_host_with_device:     STONITH_node-1 can not fence (reboot)
> node-0: static-list
> > ....
> > Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:   notice:
> stonith_choose_peer:    Couldn't find anyone to fence node-0 with <any>
> > Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:     info:
> call_remote_stonith:    Total remote op timeout set to 60 for fencing of
> node node-0 for crmd.39210.18cc29db
> > Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:     info:
> call_remote_stonith:    None of the 1 peers have devices capable of
> terminating node-0 for crmd.39210 (0)
> > ....
> > Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:  warning:
> get_xpath_object:   No match for //@st_delegate in /st-reply
> > Aug 12 16:42:47 [39206] A6-4U24-402-T   stonithd:    error:
> remote_op_done:     Operation reboot of node-0 by node-0 for
> crmd.39210 at node-0.18cc29db: No such device
> > ....
> > Aug 12 16:42:47 [39210] A6-4U24-402-T       crmd:   notice:
> tengine_stonith_callback:   Stonith operation
> 3/23:16:0:0856a484-6b69-4280-b93f-1af9a6a542ee: No such device (-19)
> > Aug 12 16:42:47 [39210] A6-4U24-402-T       crmd:   notice:
> tengine_stonith_callback:   Stonith operation 3 for node-0 failed (No such
> device): aborting transition.
> > Aug 12 16:42:47 [39210] A6-4U24-402-T       crmd:     info:
> abort_transition_graph:     Transition aborted: Stonith failed
> (source=tengine_stonith_callback:697, 0)
> > Aug 12 16:42:47 [39210] A6-4U24-402-T       crmd:   notice:
> tengine_stonith_notify:     Peer node-0 was not terminated (reboot) by
> node-0 for node-0: No such device
> >
> >
> > +-------------------------
> > + LOG_snippet_2:
> > +-------------------------
> > Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice:
> handle_request:  Client crmd.9009.cabd2154 wants to fence (reboot) 'node-0'
> with device '(any)'
> > Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice:
> initiate_remote_stonith_op:  Initiating remote operation reboot for node-0:
> 3b06d3ce-b100-46d7-874e-96f10348d9e4 (0)
> > ....
> > Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice:
> can_fence_host_with_device:  STONITH_node-1 can fence (reboot) node-0: none
> > ....
> > Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:     info:
> call_remote_stonith:     Total remote op timeout set to 60 for fencing of
> node node-0 for crmd.9009.3b06d3ce
> > Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:     info:
> call_remote_stonith:     Requesting that node-0 perform op reboot node-0
> for crmd.9009 (72s)
> > ....
> > Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:   notice:
> can_fence_host_with_device:  STONITH_node-1 can fence (reboot) node-0: none
> > Aug 11 16:09:42 [9005] A6-4U24-402-T   stonithd:     info:
> stonith_fence_get_devices_cb:    Found 1 matching devices for 'node-0'
> > ....
> > Aug 11 16:09:53 [9005] A6-4U24-402-T   stonithd:   notice:
> log_operation:   Operation 'reboot' [25511] (call 3 from crmd.9009) for
> host 'node-0' with device 'STONITH_node-1' returned: 0 (OK)
> > Aug 11 16:09:53 [9005] A6-4U24-402-T   stonithd:  warning:
> get_xpath_object:    No match for //@st_delegate in /st-reply
> > Aug 11 16:09:53 [9005] A6-4U24-402-T   stonithd:   notice:
> remote_op_done:  Operation reboot of node-0 by node-0 for
> crmd.9009 at node-0.3b06d3ce: OK
> > ....
> > Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:   notice:
> tengine_stonith_callback:    Stonith operation
> 3/23:115:0:70ac834e-5b67-4ca6-9080-c98d2b59e2ee: OK (0)
> > Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:     info:
> crm_update_peer_join:    crmd_peer_down: Node node-0[1] - join-2 phase 4 ->
> 0
> > Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:     info:
> crm_update_peer_expected:    crmd_peer_down: Node node-0[1] - expected
> state is now down (was member)
> > ....
> > Aug 11 16:09:53 [9009] A6-4U24-402-T       crmd:     crit:
> tengine_stonith_notify:  We were alegedly just fenced by node-0 for node-0!
> > ....
> > Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd:    error:
> pcmk_child_exit:     Child process crmd (9009) exited: Network is down (100)
> > ....
> > Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd:  warning:
> pcmk_child_exit:     Pacemaker child process crmd no longer wishes to be
> respawned. Shutting ourselves down.
> > ....
> > Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd:   notice:
> pcmk_shutdown_worker:    Shuting down Pacemaker
> >
> >
> > Thank you,
> > Kostya
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150821/906d5ec3/attachment-0003.html>