[ClusterLabs] [ClusterLabs Developers] When I pull out all heartbeat cables, Active-node and Passive-node are both fenced(reboot) by each other at the same time

Tue Sep 25 10:12:18 EDT 2018

(Moving this to users at clusterlabs.org list, which is better suited for
it)

This is expected behavior with this configuration. You have several
options to change it:

* The simplest would be to add pcmk_delay_max to the st-lxha
parameters. This will insert a random delay up to whatever value you
choose before executing fencing. Therefore, in a split, each side will
wait a random amount of time before fencing, and it becomes unlikely
that both will fence at the same time.

* Another common approach is to use two devices (one for each host)
instead of one. Then you can put a fixed delay on one of them with
pcmk_delay_base to ensure that they don't fence at the same time
(effectively choosing one node to win any race).

* Another option would be to add a third node for quorum only. It could
be a full cluster node that is not allowed to run any resources, or it
could be a light-weight qdevice node (but I think that requires a newer
corosync than you have). This option ensures that a node will not
attempt to fence the other node unless it has connectivity to the
quorum node.

FYI, external/ssh is not a reliable fence mechanism, because it will
fail if the target node is unresponsive or unreachable. If these are
physical machines, they likely have IPMI, which would be a better
choice than ssh, though it still cannot handle the case where the
target node has lost power. Physical machines also likely have hardware
watchdogs, which would be a much better choice (via sbd), however that
would require either a third node for quorum, or a shared storage
device. An intelligent power switch is another excellent choice.

On Tue, 2018-09-25 at 20:38 +0800, zhongbin wrote:
> Hi,
>   I created Active/Passive Clusters on Debian 6.0.
>   nodes: linx60147   linx60149
>   corosync 2.3.4  +  pacemaker 1.1.17
> 
>   crm configure show:
> 
> node 3232244115: linx60147 \
>         attributes standby=off
> node 3232244117: linx60149 \
>         attributes standby=off
> primitive rsc-cpu ocf:pacemaker:HealthCPU \
>         params yellow_limit=60 red_limit=20 \
>         op monitor interval=30s timeout=3m \
>         op start interval=0 timeout=3m \
>         op stop interval=0 timeout=3m \
>         meta target-role=Started
> primitive rsc-vip-public IPaddr \
>         op monitor interval=30s timeout=3m start-delay=15 \
>         op start interval=0 timeout=3m \
>         op stop interval=0 timeout=3m \
>         params ip=192.168.22.224 cidr_netmask=255.255.255.0 \
>         meta migration-threshold=10
> primitive st-lxha stonith:external/ssh \
>         params hostlist="linx60147 linx60149" \
>         meta target-role=Started is-managed=true
> group rsc-group rsc-vip-public rsc-cpu \
>         meta target-role=Started
> location rsc-loc1 rsc-group 200: linx60147
> location rsc-loc2 rsc-group 100: linx60149
> location rsc-loc3 st-lxha 100: linx60147
> location rsc-loc4 st-lxha 200: linx60149
> property cib-bootstrap-options: \
>         have-watchdog=false \
>         dc-version=1.1.17-b36b869ca8 \
>         cluster-infrastructure=corosync \
>         expected-quorum-votes=2 \
>         start-failure-is-fatal=false \
>         stonith-enabled=true \
>         stonith-action=reboot \
>         no-quorum-policy=ignore \
>         last-lrm-refresh=1536225282
> 
> When I  pull out all heartbeat cables,Active-node  and Passive-node 
> are both fenced(reboot) by each other at the same time.
> 
> linux60147  corosync.log:
> 
> Sep 25 19:34:08 [2198] linx60147    pengine:   notice:
> unpack_config:   On loss of CCM Quorum: Ignore
> Sep 25 19:34:08 [2198] linx60147    pengine:  warning:
> pe_fence_node:   Cluster node linx60149 will be fenced: peer is no
> longer part of the cluster
> Sep 25 19:34:08 [2198] linx60147    pengine:  warning:
> determine_online_status: Node linx60149 is unclean
> Sep 25 19:34:08 [2198] linx60147    pengine:     info:
> determine_online_status_fencing: Node linx60147 is active
> Sep 25 19:34:08 [2198] linx60147    pengine:     info:
> determine_online_status: Node linx60147 is online
> Sep 25 19:34:08 [2198] linx60147    pengine:     info:
> unpack_node_loop:        Node 3232244117 is already processed
> Sep 25 19:34:08 [2198] linx60147    pengine:     info:
> unpack_node_loop:        Node 3232244115 is already processed
> Sep 25 19:34:08 [2198] linx60147    pengine:     info:
> unpack_node_loop:        Node 3232244117 is already processed
> Sep 25 19:34:08 [2198] linx60147    pengine:     info:
> unpack_node_loop:        Node 3232244115 is already processed
> Sep 25 19:34:08 [2198] linx60147    pengine:     info: group_print: 
>     Resource Group: rsc-group
> Sep 25 19:34:08 [2198] linx60147    pengine:     info: common_print: 
>        rsc-vip-public     (ocf::heartbeat:IPaddr):        Started
> linx60147
> Sep 25 19:34:08 [2198] linx60147    pengine:     info: common_print: 
>        rsc-cpu    (ocf::pacemaker:HealthCPU):     Started linx60147
> Sep 25 19:34:08 [2198] linx60147    pengine:     info: common_print: 
>   st-lxha (stonith:external/ssh): Started linx60149 (UNCLEAN)
> Sep 25 19:34:08 [2198] linx60147    pengine:  warning:
> custom_action:   Action st-lxha_stop_0 on linx60149 is unrunnable
> (offline)
> Sep 25 19:34:08 [2198] linx60147    pengine:  warning: stage6: 
> Scheduling Node linx60149 for STONITH
> Sep 25 19:34:08 [2198] linx60147    pengine:     info:
> native_stop_constraints: st-lxha_stop_0 is implicit after linx60149
> is fenced
> Sep 25 19:34:08 [2198] linx60147    pengine:   notice:
> LogNodeActions:   * Fence linx60149
> Sep 25 19:34:08 [2198] linx60147    pengine:     info: LogActions:   
>   Leave   rsc-vip-public  (Started linx60147)
> Sep 25 19:34:08 [2198] linx60147    pengine:     info: LogActions:   
>   Leave   rsc-cpu (Started linx60147)
> Sep 25 19:34:08 [2198] linx60147    pengine:   notice: LogActions:   
>   Move    st-lxha (Started linx60149 -> linx60147)
> Sep 25 19:34:08 [2198] linx60147    pengine:  warning:
> process_pe_message:      Calculated transition 2 (with warnings),
> saving inputs in /var/lib/pacemaker/pengine/pe-warn-64.bz2
> Sep 25 19:34:08 [2199] linx60147       crmd:     info:
> do_state_transition:     State transition S_POLICY_ENGINE ->
> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response
> Sep 25 19:34:08 [2199] linx60147       crmd:     info: do_te_invoke: 
>   Processing graph 2 (ref=pe_calc-dc-1537875248-29) derived from
> /var/lib/pacemaker/pengine/pe-warn-64.bz2
> Sep 25 19:34:08 [2199] linx60147       crmd:   notice:
> te_fence_node:   Requesting fencing (reboot) of node linx60149 |
> action=15 timeout=60000
> Sep 25 19:34:08 [2194] linx60147 stonith-ng:   notice:
> handle_request:  Client crmd.2199.76b55dfe wants to fence (reboot)
> 'linx60149' with device '(any)'
> Sep 25 19:34:08 [2194] linx60147 stonith-ng:   notice:
> initiate_remote_stonith_op:      Requesting peer fencing (reboot) of
> linx60149 | id=07b318da-0c28-476a-a9f3-d73d7a5142dc state=0
> Sep 25 19:34:08 [2199] linx60147       crmd:   notice:
> te_rsc_command:  Initiating start operation st-lxha_start_0 locally
> on linx60147 | action 13
> Sep 25 19:34:08 [2199] linx60147       crmd:     info:
> do_lrm_rsc_op:   Performing key=13:2:0:05c1e621-d48e-4854-a666-
> 4c664da9e32d op=st-lxha_start_0
> Sep 25 19:34:08 [2195] linx60147       lrmd:     info: log_execute: 
>    executing - rsc:st-lxha action:start call_id:18
> Sep 25 19:34:08 [2194] linx60147 stonith-ng:     info:
> dynamic_list_search_cb:  Refreshing port list for st-lxha
> Sep 25 19:34:08 [2194] linx60147 stonith-ng:     info:
> process_remote_stonith_query:    Query result 1 of 1 from linx60147
> for linx60149/reboot (1 devices) 07b318da-0c28-476a-a9f3-d73d7a5142dc
> Sep 25 19:34:08 [2194] linx60147 stonith-ng:     info:
> process_remote_stonith_query:    All query replies have arrived,
> continuing (1 expected/1 received)
> Sep 25 19:34:08 [2194] linx60147 stonith-ng:     info:
> call_remote_stonith:     Total timeout set to 60 for peer's fencing
> of linx60149 for crmd.2199|id=07b318da-0c28-476a-a9f3-d73d7a5142dc
> Sep 25 19:34:08 [2194] linx60147 stonith-ng:     info:
> call_remote_stonith:     Requesting that 'linx60147' perform op
> 'linx60149 reboot' for crmd.2199 (72s, 0s)
> Sep 25 19:34:08 [2194] linx60147 stonith-ng:   notice:
> can_fence_host_with_device:      st-lxha can fence (reboot)
> linx60149: dynamic-list
> Sep 25 19:34:08 [2194] linx60147 stonith-ng:     info:
> stonith_fence_get_devices_cb:    Found 1 matching devices for
> 'linx60149'
> Sep 25 19:34:09 [2195] linx60147       lrmd:     info: log_finished: 
>   finished - rsc:st-lxha action:start call_id:18  exit-code:0 exec-
> time:1024ms queue-time:0ms
> Sep 25 19:34:09 [2199] linx60147       crmd:   notice:
> process_lrm_event:       Result of start operation for st-lxha on
> linx60147: 0 (ok) | call=18 key=st-lxha_start_0 confirmed=true cib-
> update=51
> Sep 25 19:34:09 [2193] linx60147        cib:     info:
> cib_process_request:     Forwarding cib_modify operation for section
> status to all (origin=local/crmd/51)
> Sep 25 19:34:09 [2193] linx60147        cib:     info:
> cib_perform_op:  Diff: --- 0.102.21 2
> Sep 25 19:34:09 [2193] linx60147        cib:     info:
> cib_perform_op:  Diff: +++ 0.102.22 (null)
> Sep 25 19:34:09 [2193] linx60147        cib:     info:
> cib_perform_op:  +  /cib:  @num_updates=22
> Sep 25 19:34:09 [2193] linx60147        cib:     info:
> cib_perform_op:  +  /cib/status/node_state[@id='3232244115']:  @crm-
> debug-origin=do_update_resource
> Sep 25 19:34:09 [2193] linx60147        cib:     info:
> cib_perform_op:  + 
> /cib/status/node_state[@id='3232244115']/lrm[@id='3232244115']/lrm_re
> sources/lrm_resource[@id='st-lxha']/lrm_rsc_op[@id='st-
> lxha_last_0']:  @operation_key=st-lxha_start_0, @operation=start,
> @transition-key=13:2:0:05c1e621-d48e-4854-a666-4c664da9e32d,
> @transition-magic=0:0;13:2:0:05c1e621-d48e-4854-a666-4c664da9e32d,
> @call-id=18, @rc-code=0, @last-run=1537875248, @last-rc-
> change=1537875248, @exec-time=1024
> Sep 25 19:34:09 [2193] linx60147        cib:     info:
> cib_process_request:     Completed cib_modify operation for section
> status: OK (rc=0, origin=linx60147/crmd/51, version=0.102.22)
> Sep 25 19:34:09 [2199] linx60147       crmd:     info:
> match_graph_event:       Action st-lxha_start_0 (13) confirmed on
> linx60147 (rc=0)
> 
> linux60149  corosync.log:
> 
> Sep 25 19:34:07 [2144] linx60149    pengine:   notice:
> unpack_config:   On loss of CCM Quorum: Ignore
> Sep 25 19:34:07 [2144] linx60149    pengine:     info:
> determine_online_status_fencing: Node linx60149 is active
> Sep 25 19:34:07 [2144] linx60149    pengine:     info:
> determine_online_status: Node linx60149 is online
> Sep 25 19:34:07 [2144] linx60149    pengine:  warning:
> pe_fence_node:   Cluster node linx60147 will be fenced: peer is no
> longer part of the cluster
> Sep 25 19:34:07 [2144] linx60149    pengine:  warning:
> determine_online_status: Node linx60147 is unclean
> Sep 25 19:34:07 [2144] linx60149    pengine:     info:
> unpack_node_loop:        Node 3232244117 is already processed
> Sep 25 19:34:07 [2144] linx60149    pengine:     info:
> unpack_node_loop:        Node 3232244115 is already processed
> Sep 25 19:34:07 [2144] linx60149    pengine:     info:
> unpack_node_loop:        Node 3232244117 is already processed
> Sep 25 19:34:07 [2144] linx60149    pengine:     info:
> unpack_node_loop:        Node 3232244115 is already processed
> Sep 25 19:34:07 [2144] linx60149    pengine:     info: group_print: 
>     Resource Group: rsc-group
> Sep 25 19:34:07 [2144] linx60149    pengine:     info: common_print: 
>        rsc-vip-public     (ocf::heartbeat:IPaddr):        Started
> linx60147 (UNCLEAN)
> Sep 25 19:34:07 [2144] linx60149    pengine:     info: common_print: 
>        rsc-cpu    (ocf::pacemaker:HealthCPU):     Started linx60147
> (UNCLEAN)
> Sep 25 19:34:07 [2144] linx60149    pengine:     info: common_print: 
>   st-lxha (stonith:external/ssh): Started linx60149
> Sep 25 19:34:07 [2144] linx60149    pengine:  warning:
> custom_action:   Action rsc-vip-public_stop_0 on linx60147 is
> unrunnable (offline)
> Sep 25 19:34:07 [2144] linx60149    pengine:     info: RecurringOp: 
>     Start recurring monitor (30s) for rsc-vip-public on linx60149
> Sep 25 19:34:07 [2144] linx60149    pengine:  warning:
> custom_action:   Action rsc-cpu_stop_0 on linx60147 is unrunnable
> (offline)
> Sep 25 19:34:07 [2144] linx60149    pengine:     info: RecurringOp: 
>     Start recurring monitor (30s) for rsc-cpu on linx60149
> Sep 25 19:34:07 [2144] linx60149    pengine:  warning: stage6: 
> Scheduling Node linx60147 for STONITH
> Sep 25 19:34:07 [2144] linx60149    pengine:     info:
> native_stop_constraints: rsc-vip-public_stop_0 is implicit after
> linx60147 is fenced
> Sep 25 19:34:07 [2144] linx60149    pengine:     info:
> native_stop_constraints: rsc-cpu_stop_0 is implicit after linx60147
> is fenced
> Sep 25 19:34:07 [2144] linx60149    pengine:   notice:
> LogNodeActions:   * Fence linx60147
> Sep 25 19:34:07 [2144] linx60149    pengine:   notice: LogActions:   
>   Move    rsc-vip-public  (Started linx60147 -> linx60149)
> Sep 25 19:34:07 [2144] linx60149    pengine:   notice: LogActions:   
>   Move    rsc-cpu (Started linx60147 -> linx60149)
> Sep 25 19:34:07 [2144] linx60149    pengine:     info: LogActions:   
>   Leave   st-lxha (Started linx60149)
> Sep 25 19:34:07 [2144] linx60149    pengine:  warning:
> process_pe_message:      Calculated transition 0 (with warnings),
> saving inputs in /var/lib/pacemaker/pengine/pe-warn-52.bz2
> Sep 25 19:34:07 [2145] linx60149       crmd:     info:
> do_state_transition:     State transition S_POLICY_ENGINE ->
> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response
> Sep 25 19:34:07 [2145] linx60149       crmd:     info: do_te_invoke: 
>   Processing graph 0 (ref=pe_calc-dc-1537875247-15) derived from
> /var/lib/pacemaker/pengine/pe-warn-52.bz2
> Sep 25 19:34:07 [2145] linx60149       crmd:   notice:
> te_fence_node:   Requesting fencing (reboot) of node linx60147 |
> action=15 timeout=60000
> Sep 25 19:34:07 [2141] linx60149 stonith-ng:   notice:
> handle_request:  Client crmd.2145.321125df wants to fence (reboot)
> 'linx60147' with device '(any)'
> Sep 25 19:34:07 [2141] linx60149 stonith-ng:   notice:
> initiate_remote_stonith_op:      Requesting peer fencing (reboot) of
> linx60147 | id=05d67c3b-8ff2-4e8d-b56f-abb305d3133c state=0
> Sep 25 19:34:07 [2141] linx60149 stonith-ng:     info:
> dynamic_list_search_cb:  Refreshing port list for st-lxha
> Sep 25 19:34:07 [2141] linx60149 stonith-ng:     info:
> process_remote_stonith_query:    Query result 1 of 1 from linx60149
> for linx60147/reboot (1 devices) 05d67c3b-8ff2-4e8d-b56f-abb305d3133c
> Sep 25 19:34:07 [2141] linx60149 stonith-ng:     info:
> call_remote_stonith:     Total timeout set to 60 for peer's fencing
> of linx60147 for crmd.2145|id=05d67c3b-8ff2-4e8d-b56f-abb305d3133c
> Sep 25 19:34:07 [2141] linx60149 stonith-ng:     info:
> call_remote_stonith:     Requesting that 'linx60149' perform op
> 'linx60147 reboot' for crmd.2145 (72s, 0s)
> Sep 25 19:34:07 [2141] linx60149 stonith-ng:   notice:
> can_fence_host_with_device:      st-lxha can fence (reboot)
> linx60147: dynamic-list
> Sep 25 19:34:07 [2141] linx60149 stonith-ng:     info:
> stonith_fence_get_devices_cb:    Found 1 matching devices for
> 'linx60147'
> 
> Is this behavior of cluster  normal? Or is it configured with errors?
> How can I avoid it?
> 
> Thanks,
> 
>     zhongbin
> 
> 
>  
> _______________________________________________
> Developers mailing list
> Developers at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/developers
-- 
Ken Gaillot <kgaillot at redhat.com>