[ClusterLabs] [ClusterLabs Developers] When I pull out all heartbeat cables, Active-node and Passive-node are both fenced(reboot) by each other at the same time
Ken Gaillot
kgaillot at redhat.com
Tue Sep 25 10:12:18 EDT 2018
(Moving this to users at clusterlabs.org list, which is better suited for
it)
This is expected behavior with this configuration. You have several
options to change it:
* The simplest would be to add pcmk_delay_max to the st-lxha
parameters. This will insert a random delay up to whatever value you
choose before executing fencing. Therefore, in a split, each side will
wait a random amount of time before fencing, and it becomes unlikely
that both will fence at the same time.
* Another common approach is to use two devices (one for each host)
instead of one. Then you can put a fixed delay on one of them with
pcmk_delay_base to ensure that they don't fence at the same time
(effectively choosing one node to win any race).
* Another option would be to add a third node for quorum only. It could
be a full cluster node that is not allowed to run any resources, or it
could be a light-weight qdevice node (but I think that requires a newer
corosync than you have). This option ensures that a node will not
attempt to fence the other node unless it has connectivity to the
quorum node.
FYI, external/ssh is not a reliable fence mechanism, because it will
fail if the target node is unresponsive or unreachable. If these are
physical machines, they likely have IPMI, which would be a better
choice than ssh, though it still cannot handle the case where the
target node has lost power. Physical machines also likely have hardware
watchdogs, which would be a much better choice (via sbd), however that
would require either a third node for quorum, or a shared storage
device. An intelligent power switch is another excellent choice.
On Tue, 2018-09-25 at 20:38 +0800, zhongbin wrote:
> Hi,
> I created Active/Passive Clusters on Debian 6.0.
> nodes: linx60147 linx60149
> corosync 2.3.4 + pacemaker 1.1.17
>
> crm configure show:
>
> node 3232244115: linx60147 \
> attributes standby=off
> node 3232244117: linx60149 \
> attributes standby=off
> primitive rsc-cpu ocf:pacemaker:HealthCPU \
> params yellow_limit=60 red_limit=20 \
> op monitor interval=30s timeout=3m \
> op start interval=0 timeout=3m \
> op stop interval=0 timeout=3m \
> meta target-role=Started
> primitive rsc-vip-public IPaddr \
> op monitor interval=30s timeout=3m start-delay=15 \
> op start interval=0 timeout=3m \
> op stop interval=0 timeout=3m \
> params ip=192.168.22.224 cidr_netmask=255.255.255.0 \
> meta migration-threshold=10
> primitive st-lxha stonith:external/ssh \
> params hostlist="linx60147 linx60149" \
> meta target-role=Started is-managed=true
> group rsc-group rsc-vip-public rsc-cpu \
> meta target-role=Started
> location rsc-loc1 rsc-group 200: linx60147
> location rsc-loc2 rsc-group 100: linx60149
> location rsc-loc3 st-lxha 100: linx60147
> location rsc-loc4 st-lxha 200: linx60149
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.17-b36b869ca8 \
> cluster-infrastructure=corosync \
> expected-quorum-votes=2 \
> start-failure-is-fatal=false \
> stonith-enabled=true \
> stonith-action=reboot \
> no-quorum-policy=ignore \
> last-lrm-refresh=1536225282
>
> When I pull out all heartbeat cables,Active-node and Passive-node
> are both fenced(reboot) by each other at the same time.
>
> linux60147 corosync.log:
>
> Sep 25 19:34:08 [2198] linx60147 pengine: notice:
> unpack_config: On loss of CCM Quorum: Ignore
> Sep 25 19:34:08 [2198] linx60147 pengine: warning:
> pe_fence_node: Cluster node linx60149 will be fenced: peer is no
> longer part of the cluster
> Sep 25 19:34:08 [2198] linx60147 pengine: warning:
> determine_online_status: Node linx60149 is unclean
> Sep 25 19:34:08 [2198] linx60147 pengine: info:
> determine_online_status_fencing: Node linx60147 is active
> Sep 25 19:34:08 [2198] linx60147 pengine: info:
> determine_online_status: Node linx60147 is online
> Sep 25 19:34:08 [2198] linx60147 pengine: info:
> unpack_node_loop: Node 3232244117 is already processed
> Sep 25 19:34:08 [2198] linx60147 pengine: info:
> unpack_node_loop: Node 3232244115 is already processed
> Sep 25 19:34:08 [2198] linx60147 pengine: info:
> unpack_node_loop: Node 3232244117 is already processed
> Sep 25 19:34:08 [2198] linx60147 pengine: info:
> unpack_node_loop: Node 3232244115 is already processed
> Sep 25 19:34:08 [2198] linx60147 pengine: info: group_print:
> Resource Group: rsc-group
> Sep 25 19:34:08 [2198] linx60147 pengine: info: common_print:
> rsc-vip-public (ocf::heartbeat:IPaddr): Started
> linx60147
> Sep 25 19:34:08 [2198] linx60147 pengine: info: common_print:
> rsc-cpu (ocf::pacemaker:HealthCPU): Started linx60147
> Sep 25 19:34:08 [2198] linx60147 pengine: info: common_print:
> st-lxha (stonith:external/ssh): Started linx60149 (UNCLEAN)
> Sep 25 19:34:08 [2198] linx60147 pengine: warning:
> custom_action: Action st-lxha_stop_0 on linx60149 is unrunnable
> (offline)
> Sep 25 19:34:08 [2198] linx60147 pengine: warning: stage6:
> Scheduling Node linx60149 for STONITH
> Sep 25 19:34:08 [2198] linx60147 pengine: info:
> native_stop_constraints: st-lxha_stop_0 is implicit after linx60149
> is fenced
> Sep 25 19:34:08 [2198] linx60147 pengine: notice:
> LogNodeActions: * Fence linx60149
> Sep 25 19:34:08 [2198] linx60147 pengine: info: LogActions:
> Leave rsc-vip-public (Started linx60147)
> Sep 25 19:34:08 [2198] linx60147 pengine: info: LogActions:
> Leave rsc-cpu (Started linx60147)
> Sep 25 19:34:08 [2198] linx60147 pengine: notice: LogActions:
> Move st-lxha (Started linx60149 -> linx60147)
> Sep 25 19:34:08 [2198] linx60147 pengine: warning:
> process_pe_message: Calculated transition 2 (with warnings),
> saving inputs in /var/lib/pacemaker/pengine/pe-warn-64.bz2
> Sep 25 19:34:08 [2199] linx60147 crmd: info:
> do_state_transition: State transition S_POLICY_ENGINE ->
> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response
> Sep 25 19:34:08 [2199] linx60147 crmd: info: do_te_invoke:
> Processing graph 2 (ref=pe_calc-dc-1537875248-29) derived from
> /var/lib/pacemaker/pengine/pe-warn-64.bz2
> Sep 25 19:34:08 [2199] linx60147 crmd: notice:
> te_fence_node: Requesting fencing (reboot) of node linx60149 |
> action=15 timeout=60000
> Sep 25 19:34:08 [2194] linx60147 stonith-ng: notice:
> handle_request: Client crmd.2199.76b55dfe wants to fence (reboot)
> 'linx60149' with device '(any)'
> Sep 25 19:34:08 [2194] linx60147 stonith-ng: notice:
> initiate_remote_stonith_op: Requesting peer fencing (reboot) of
> linx60149 | id=07b318da-0c28-476a-a9f3-d73d7a5142dc state=0
> Sep 25 19:34:08 [2199] linx60147 crmd: notice:
> te_rsc_command: Initiating start operation st-lxha_start_0 locally
> on linx60147 | action 13
> Sep 25 19:34:08 [2199] linx60147 crmd: info:
> do_lrm_rsc_op: Performing key=13:2:0:05c1e621-d48e-4854-a666-
> 4c664da9e32d op=st-lxha_start_0
> Sep 25 19:34:08 [2195] linx60147 lrmd: info: log_execute:
> executing - rsc:st-lxha action:start call_id:18
> Sep 25 19:34:08 [2194] linx60147 stonith-ng: info:
> dynamic_list_search_cb: Refreshing port list for st-lxha
> Sep 25 19:34:08 [2194] linx60147 stonith-ng: info:
> process_remote_stonith_query: Query result 1 of 1 from linx60147
> for linx60149/reboot (1 devices) 07b318da-0c28-476a-a9f3-d73d7a5142dc
> Sep 25 19:34:08 [2194] linx60147 stonith-ng: info:
> process_remote_stonith_query: All query replies have arrived,
> continuing (1 expected/1 received)
> Sep 25 19:34:08 [2194] linx60147 stonith-ng: info:
> call_remote_stonith: Total timeout set to 60 for peer's fencing
> of linx60149 for crmd.2199|id=07b318da-0c28-476a-a9f3-d73d7a5142dc
> Sep 25 19:34:08 [2194] linx60147 stonith-ng: info:
> call_remote_stonith: Requesting that 'linx60147' perform op
> 'linx60149 reboot' for crmd.2199 (72s, 0s)
> Sep 25 19:34:08 [2194] linx60147 stonith-ng: notice:
> can_fence_host_with_device: st-lxha can fence (reboot)
> linx60149: dynamic-list
> Sep 25 19:34:08 [2194] linx60147 stonith-ng: info:
> stonith_fence_get_devices_cb: Found 1 matching devices for
> 'linx60149'
> Sep 25 19:34:09 [2195] linx60147 lrmd: info: log_finished:
> finished - rsc:st-lxha action:start call_id:18 exit-code:0 exec-
> time:1024ms queue-time:0ms
> Sep 25 19:34:09 [2199] linx60147 crmd: notice:
> process_lrm_event: Result of start operation for st-lxha on
> linx60147: 0 (ok) | call=18 key=st-lxha_start_0 confirmed=true cib-
> update=51
> Sep 25 19:34:09 [2193] linx60147 cib: info:
> cib_process_request: Forwarding cib_modify operation for section
> status to all (origin=local/crmd/51)
> Sep 25 19:34:09 [2193] linx60147 cib: info:
> cib_perform_op: Diff: --- 0.102.21 2
> Sep 25 19:34:09 [2193] linx60147 cib: info:
> cib_perform_op: Diff: +++ 0.102.22 (null)
> Sep 25 19:34:09 [2193] linx60147 cib: info:
> cib_perform_op: + /cib: @num_updates=22
> Sep 25 19:34:09 [2193] linx60147 cib: info:
> cib_perform_op: + /cib/status/node_state[@id='3232244115']: @crm-
> debug-origin=do_update_resource
> Sep 25 19:34:09 [2193] linx60147 cib: info:
> cib_perform_op: +
> /cib/status/node_state[@id='3232244115']/lrm[@id='3232244115']/lrm_re
> sources/lrm_resource[@id='st-lxha']/lrm_rsc_op[@id='st-
> lxha_last_0']: @operation_key=st-lxha_start_0, @operation=start,
> @transition-key=13:2:0:05c1e621-d48e-4854-a666-4c664da9e32d,
> @transition-magic=0:0;13:2:0:05c1e621-d48e-4854-a666-4c664da9e32d,
> @call-id=18, @rc-code=0, @last-run=1537875248, @last-rc-
> change=1537875248, @exec-time=1024
> Sep 25 19:34:09 [2193] linx60147 cib: info:
> cib_process_request: Completed cib_modify operation for section
> status: OK (rc=0, origin=linx60147/crmd/51, version=0.102.22)
> Sep 25 19:34:09 [2199] linx60147 crmd: info:
> match_graph_event: Action st-lxha_start_0 (13) confirmed on
> linx60147 (rc=0)
>
> linux60149 corosync.log:
>
> Sep 25 19:34:07 [2144] linx60149 pengine: notice:
> unpack_config: On loss of CCM Quorum: Ignore
> Sep 25 19:34:07 [2144] linx60149 pengine: info:
> determine_online_status_fencing: Node linx60149 is active
> Sep 25 19:34:07 [2144] linx60149 pengine: info:
> determine_online_status: Node linx60149 is online
> Sep 25 19:34:07 [2144] linx60149 pengine: warning:
> pe_fence_node: Cluster node linx60147 will be fenced: peer is no
> longer part of the cluster
> Sep 25 19:34:07 [2144] linx60149 pengine: warning:
> determine_online_status: Node linx60147 is unclean
> Sep 25 19:34:07 [2144] linx60149 pengine: info:
> unpack_node_loop: Node 3232244117 is already processed
> Sep 25 19:34:07 [2144] linx60149 pengine: info:
> unpack_node_loop: Node 3232244115 is already processed
> Sep 25 19:34:07 [2144] linx60149 pengine: info:
> unpack_node_loop: Node 3232244117 is already processed
> Sep 25 19:34:07 [2144] linx60149 pengine: info:
> unpack_node_loop: Node 3232244115 is already processed
> Sep 25 19:34:07 [2144] linx60149 pengine: info: group_print:
> Resource Group: rsc-group
> Sep 25 19:34:07 [2144] linx60149 pengine: info: common_print:
> rsc-vip-public (ocf::heartbeat:IPaddr): Started
> linx60147 (UNCLEAN)
> Sep 25 19:34:07 [2144] linx60149 pengine: info: common_print:
> rsc-cpu (ocf::pacemaker:HealthCPU): Started linx60147
> (UNCLEAN)
> Sep 25 19:34:07 [2144] linx60149 pengine: info: common_print:
> st-lxha (stonith:external/ssh): Started linx60149
> Sep 25 19:34:07 [2144] linx60149 pengine: warning:
> custom_action: Action rsc-vip-public_stop_0 on linx60147 is
> unrunnable (offline)
> Sep 25 19:34:07 [2144] linx60149 pengine: info: RecurringOp:
> Start recurring monitor (30s) for rsc-vip-public on linx60149
> Sep 25 19:34:07 [2144] linx60149 pengine: warning:
> custom_action: Action rsc-cpu_stop_0 on linx60147 is unrunnable
> (offline)
> Sep 25 19:34:07 [2144] linx60149 pengine: info: RecurringOp:
> Start recurring monitor (30s) for rsc-cpu on linx60149
> Sep 25 19:34:07 [2144] linx60149 pengine: warning: stage6:
> Scheduling Node linx60147 for STONITH
> Sep 25 19:34:07 [2144] linx60149 pengine: info:
> native_stop_constraints: rsc-vip-public_stop_0 is implicit after
> linx60147 is fenced
> Sep 25 19:34:07 [2144] linx60149 pengine: info:
> native_stop_constraints: rsc-cpu_stop_0 is implicit after linx60147
> is fenced
> Sep 25 19:34:07 [2144] linx60149 pengine: notice:
> LogNodeActions: * Fence linx60147
> Sep 25 19:34:07 [2144] linx60149 pengine: notice: LogActions:
> Move rsc-vip-public (Started linx60147 -> linx60149)
> Sep 25 19:34:07 [2144] linx60149 pengine: notice: LogActions:
> Move rsc-cpu (Started linx60147 -> linx60149)
> Sep 25 19:34:07 [2144] linx60149 pengine: info: LogActions:
> Leave st-lxha (Started linx60149)
> Sep 25 19:34:07 [2144] linx60149 pengine: warning:
> process_pe_message: Calculated transition 0 (with warnings),
> saving inputs in /var/lib/pacemaker/pengine/pe-warn-52.bz2
> Sep 25 19:34:07 [2145] linx60149 crmd: info:
> do_state_transition: State transition S_POLICY_ENGINE ->
> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response
> Sep 25 19:34:07 [2145] linx60149 crmd: info: do_te_invoke:
> Processing graph 0 (ref=pe_calc-dc-1537875247-15) derived from
> /var/lib/pacemaker/pengine/pe-warn-52.bz2
> Sep 25 19:34:07 [2145] linx60149 crmd: notice:
> te_fence_node: Requesting fencing (reboot) of node linx60147 |
> action=15 timeout=60000
> Sep 25 19:34:07 [2141] linx60149 stonith-ng: notice:
> handle_request: Client crmd.2145.321125df wants to fence (reboot)
> 'linx60147' with device '(any)'
> Sep 25 19:34:07 [2141] linx60149 stonith-ng: notice:
> initiate_remote_stonith_op: Requesting peer fencing (reboot) of
> linx60147 | id=05d67c3b-8ff2-4e8d-b56f-abb305d3133c state=0
> Sep 25 19:34:07 [2141] linx60149 stonith-ng: info:
> dynamic_list_search_cb: Refreshing port list for st-lxha
> Sep 25 19:34:07 [2141] linx60149 stonith-ng: info:
> process_remote_stonith_query: Query result 1 of 1 from linx60149
> for linx60147/reboot (1 devices) 05d67c3b-8ff2-4e8d-b56f-abb305d3133c
> Sep 25 19:34:07 [2141] linx60149 stonith-ng: info:
> call_remote_stonith: Total timeout set to 60 for peer's fencing
> of linx60147 for crmd.2145|id=05d67c3b-8ff2-4e8d-b56f-abb305d3133c
> Sep 25 19:34:07 [2141] linx60149 stonith-ng: info:
> call_remote_stonith: Requesting that 'linx60149' perform op
> 'linx60147 reboot' for crmd.2145 (72s, 0s)
> Sep 25 19:34:07 [2141] linx60149 stonith-ng: notice:
> can_fence_host_with_device: st-lxha can fence (reboot)
> linx60147: dynamic-list
> Sep 25 19:34:07 [2141] linx60149 stonith-ng: info:
> stonith_fence_get_devices_cb: Found 1 matching devices for
> 'linx60147'
>
> Is this behavior of cluster normal? Or is it configured with errors?
> How can I avoid it?
>
> Thanks,
>
> zhongbin
>
>
>
> _______________________________________________
> Developers mailing list
> Developers at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/developers
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list