<div dir="ltr">Hi,<div><br></div><div>Brief description of the STONITH problem: </div><div><br></div><div>I see two different behaviors with two different STONITH configurations. If Pacemaker cannot find a device that can STONITH a problematic node, the node remains up and running. Which is bad, because it must be STONITHed.</div><div>As opposite to it, if Pacemaker finds a device that, it thinks, can STONITH a problematic node, even if the device actually cannot, Pacemaker goes down after STONITH returns false positive. The Pacemaker shutdowns itself right after STONITH.</div><div>Is it the expected behavior?</div><div>Do I need to configure a two more STONITH agents for just rebooting nodes on which they are running (e.g. with # reboot -f)?</div><div><br></div><div><br></div><div><br></div><div>+-------------------------<br></div><div>+ Set-up:</div><div>+-------------------------<br></div><div>- two node cluster (node-0 and node-1);</div><div>- two fencing (STONITH) agents are configured (STONITH_node-0 and STONITH_node-1).</div><div>- "STONITH_node-0" runs only on "node-1" // this fencing agent can only fence node-0</div><div>- "STONITH_node-1" runs only on "node-0" // this fencing agent can only fence node-1</div><div><br></div><div>+-------------------------<br></div><div>+ Environment:</div><div>+-------------------------<br></div><div>- one node - "node-0" - is up and running;</div><div>- one STONITH agent - "STONITH_node-1" - is up and running</div><div><br></div><div>+-------------------------<br></div><div>+ Test case:</div><div>+-------------------------<br></div><div>Simulate error of stopping a resource.</div><div>1. start cluster</div><div>2. change a RA's script to return "$OCF_ERR_GENERIC" from "Stop" function.</div><div>3. stop the resource by "# crm resource stop <resource>"</div><div><br></div><div>+-------------------------<br></div><div>+ Actual behavior:</div><div>+-------------------------<br></div><div><br></div><div> CASE 1:</div><div>STONITH is configured with:</div><div><div># crm configure primitive STONITH_node-1 stonith:fence_sbb_hw \</div><div> params pcmk_host_list="node-1" pcmk_host_check="static-list"</div></div><div><br></div><div>After issuing a "stop" command:</div><div> - the resource changes its state to "FAILED"</div><div> - Pacemaker remains working</div><div><br></div><div>See below LOG_snippet_1 section. </div><div><br></div><div><br></div><div><div> CASE 2:</div><div>STONITH is configured with:</div><div><div># crm configure primitive STONITH_node-1 stonith:fence_sbb_hw</div></div></div><div><div><br></div><div>After issuing a "stop" command:</div><div> - the resource changes its state to "FAILED"</div><div> - Pacemaker stops working</div></div><div><br></div><div>See below LOG_snippet_2 section.<br></div><div><br></div><div><br></div><div>+-------------------------<br></div><div>+ LOG_snippet_1:<br></div><div>+-------------------------</div><div><div>Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: notice: handle_request: Client crmd.39210.fa40430f wants to fence (reboot) 'node-0' with device '(any)'</div><div>Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: notice: initiate_remote_stonith_op: Initiating remote operation reboot for node-0: 18cc29db-b7e4-4994-85f1-df891f091a0d (0)</div><div>....</div><div>Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: notice: can_fence_host_with_device: STONITH_node-1 can not fence (reboot) node-0: static-list</div><div>....</div><div>Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: notice: stonith_choose_peer: Couldn't find anyone to fence node-0 with <any></div><div>Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: info: call_remote_stonith: Total remote op timeout set to 60 for fencing of node node-0 for crmd.39210.18cc29db</div><div>Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: info: call_remote_stonith: None of the 1 peers have devices capable of terminating node-0 for crmd.39210 (0)</div><div>....</div><div>Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: warning: get_xpath_object: No match for //@st_delegate in /st-reply</div><div>Aug 12 16:42:47 [39206] A6-4U24-402-T stonithd: error: remote_op_done: Operation reboot of node-0 by node-0 for crmd.39210@node-0.18cc29db: No such device</div><div>....</div><div>Aug 12 16:42:47 [39210] A6-4U24-402-T crmd: notice: tengine_stonith_callback: Stonith operation 3/23:16:0:0856a484-6b69-4280-b93f-1af9a6a542ee: No such device (-19)</div><div>Aug 12 16:42:47 [39210] A6-4U24-402-T crmd: notice: tengine_stonith_callback: Stonith operation 3 for node-0 failed (No such device): aborting transition.</div><div>Aug 12 16:42:47 [39210] A6-4U24-402-T crmd: info: abort_transition_graph: Transition aborted: Stonith failed (source=tengine_stonith_callback:697, 0)</div><div>Aug 12 16:42:47 [39210] A6-4U24-402-T crmd: notice: tengine_stonith_notify: Peer node-0 was not terminated (reboot) by node-0 for node-0: No such device</div></div><div><br></div><div><br></div><div>+-------------------------<br></div><div>+ LOG_snippet_2:<br></div><div>+-------------------------<br></div><div><div>Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: notice: handle_request: Client crmd.9009.cabd2154 wants to fence (reboot) 'node-0' with device '(any)'</div><div>Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: notice: initiate_remote_stonith_op: Initiating remote operation reboot for node-0: 3b06d3ce-b100-46d7-874e-96f10348d9e4 (0)</div><div>....</div><div>Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: notice: can_fence_host_with_device: STONITH_node-1 can fence (reboot) node-0: none</div><div>....</div><div>Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: info: call_remote_stonith: Total remote op timeout set to 60 for fencing of node node-0 for crmd.9009.3b06d3ce</div><div>Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: info: call_remote_stonith: Requesting that node-0 perform op reboot node-0 for crmd.9009 (72s)</div><div>....</div><div>Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: notice: can_fence_host_with_device: STONITH_node-1 can fence (reboot) node-0: none</div><div>Aug 11 16:09:42 [9005] A6-4U24-402-T stonithd: info: stonith_fence_get_devices_cb: Found 1 matching devices for 'node-0'</div><div>....</div><div>Aug 11 16:09:53 [9005] A6-4U24-402-T stonithd: notice: log_operation: Operation 'reboot' [25511] (call 3 from crmd.9009) for host 'node-0' with device 'STONITH_node-1' returned: 0 (OK)</div><div>Aug 11 16:09:53 [9005] A6-4U24-402-T stonithd: warning: get_xpath_object: No match for //@st_delegate in /st-reply</div><div>Aug 11 16:09:53 [9005] A6-4U24-402-T stonithd: notice: remote_op_done: Operation reboot of node-0 by node-0 for crmd.9009@node-0.3b06d3ce: OK</div><div>....</div><div>Aug 11 16:09:53 [9009] A6-4U24-402-T crmd: notice: tengine_stonith_callback: Stonith operation 3/23:115:0:70ac834e-5b67-4ca6-9080-c98d2b59e2ee: OK (0)</div><div>Aug 11 16:09:53 [9009] A6-4U24-402-T crmd: info: crm_update_peer_join: crmd_peer_down: Node node-0[1] - join-2 phase 4 -> 0</div><div>Aug 11 16:09:53 [9009] A6-4U24-402-T crmd: info: crm_update_peer_expected: crmd_peer_down: Node node-0[1] - expected state is now down (was member)</div><div>....</div><div>Aug 11 16:09:53 [9009] A6-4U24-402-T crmd: crit: tengine_stonith_notify: We were alegedly just fenced by node-0 for node-0!</div><div>....</div><div>Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd: error: pcmk_child_exit: Child process crmd (9009) exited: Network is down (100)</div><div>....</div><div>Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd: warning: pcmk_child_exit: Pacemaker child process crmd no longer wishes to be respawned. Shutting ourselves down.</div><div>....</div><div>Aug 11 16:09:53 [9002] A6-4U24-402-T pacemakerd: notice: pcmk_shutdown_worker: Shuting down Pacemaker</div></div><div><br></div><div><br clear="all"><div><div><div dir="ltr">Thank you,<div>Kostya</div></div></div></div>
</div></div>