[ClusterLabs Developers] When I pull out all heartbeat cables, Active-node and Passive-node are both fenced(reboot) by each other at the same time

Tue Sep 25 12:38:26 UTC 2018

Hi,
  I created Active/Passive Clusters on Debian 6.0.
  nodes: linx60147   linx60149
  corosync 2.3.4  +  pacemaker 1.1.17

  crm configure show:

node 3232244115: linx60147 \
        attributes standby=off
node 3232244117: linx60149 \
        attributes standby=off
primitive rsc-cpu ocf:pacemaker:HealthCPU \
        params yellow_limit=60 red_limit=20 \
        op monitor interval=30s timeout=3m \
        op start interval=0 timeout=3m \
        op stop interval=0 timeout=3m \
        meta target-role=Started
primitive rsc-vip-public IPaddr \
        op monitor interval=30s timeout=3m start-delay=15 \
        op start interval=0 timeout=3m \
        op stop interval=0 timeout=3m \
        params ip=192.168.22.224 cidr_netmask=255.255.255.0 \
        meta migration-threshold=10
primitive st-lxha stonith:external/ssh \
        params hostlist="linx60147 linx60149" \
        meta target-role=Started is-managed=true
group rsc-group rsc-vip-public rsc-cpu \
        meta target-role=Started
location rsc-loc1 rsc-group 200: linx60147
location rsc-loc2 rsc-group 100: linx60149
location rsc-loc3 st-lxha 100: linx60147
location rsc-loc4 st-lxha 200: linx60149
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.17-b36b869ca8 \
        cluster-infrastructure=corosync \
        expected-quorum-votes=2 \
        start-failure-is-fatal=false \
        stonith-enabled=true \
        stonith-action=reboot \
        no-quorum-policy=ignore \
        last-lrm-refresh=1536225282

When I  pull out all heartbeat cables,Active-node  and Passive-node  are both fenced(reboot) by each other at the same time.

linux60147  corosync.log:

Sep 25 19:34:08 [2198] linx60147    pengine:   notice: unpack_config:   On loss of CCM Quorum: Ignore
Sep 25 19:34:08 [2198] linx60147    pengine:  warning: pe_fence_node:   Cluster node linx60149 will be fenced: peer is no longer part of the cluster
Sep 25 19:34:08 [2198] linx60147    pengine:  warning: determine_online_status: Node linx60149 is unclean
Sep 25 19:34:08 [2198] linx60147    pengine:     info: determine_online_status_fencing: Node linx60147 is active
Sep 25 19:34:08 [2198] linx60147    pengine:     info: determine_online_status: Node linx60147 is online
Sep 25 19:34:08 [2198] linx60147    pengine:     info: unpack_node_loop:        Node 3232244117 is already processed
Sep 25 19:34:08 [2198] linx60147    pengine:     info: unpack_node_loop:        Node 3232244115 is already processed
Sep 25 19:34:08 [2198] linx60147    pengine:     info: unpack_node_loop:        Node 3232244117 is already processed
Sep 25 19:34:08 [2198] linx60147    pengine:     info: unpack_node_loop:        Node 3232244115 is already processed
Sep 25 19:34:08 [2198] linx60147    pengine:     info: group_print:      Resource Group: rsc-group
Sep 25 19:34:08 [2198] linx60147    pengine:     info: common_print:         rsc-vip-public     (ocf::heartbeat:IPaddr):        Started linx60147
Sep 25 19:34:08 [2198] linx60147    pengine:     info: common_print:         rsc-cpu    (ocf::pacemaker:HealthCPU):     Started linx60147
Sep 25 19:34:08 [2198] linx60147    pengine:     info: common_print:    st-lxha (stonith:external/ssh): Started linx60149 (UNCLEAN)
Sep 25 19:34:08 [2198] linx60147    pengine:  warning: custom_action:   Action st-lxha_stop_0 on linx60149 is unrunnable (offline)
Sep 25 19:34:08 [2198] linx60147    pengine:  warning: stage6:  Scheduling Node linx60149 for STONITH
Sep 25 19:34:08 [2198] linx60147    pengine:     info: native_stop_constraints: st-lxha_stop_0 is implicit after linx60149 is fenced
Sep 25 19:34:08 [2198] linx60147    pengine:   notice: LogNodeActions:   * Fence linx60149
Sep 25 19:34:08 [2198] linx60147    pengine:     info: LogActions:      Leave   rsc-vip-public  (Started linx60147)
Sep 25 19:34:08 [2198] linx60147    pengine:     info: LogActions:      Leave   rsc-cpu (Started linx60147)
Sep 25 19:34:08 [2198] linx60147    pengine:   notice: LogActions:      Move    st-lxha (Started linx60149 -> linx60147)
Sep 25 19:34:08 [2198] linx60147    pengine:  warning: process_pe_message:      Calculated transition 2 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-64.bz2
Sep 25 19:34:08 [2199] linx60147       crmd:     info: do_state_transition:     State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response
Sep 25 19:34:08 [2199] linx60147       crmd:     info: do_te_invoke:    Processing graph 2 (ref=pe_calc-dc-1537875248-29) derived from /var/lib/pacemaker/pengine/pe-warn-64.bz2
Sep 25 19:34:08 [2199] linx60147       crmd:   notice: te_fence_node:   Requesting fencing (reboot) of node linx60149 | action=15 timeout=60000
Sep 25 19:34:08 [2194] linx60147 stonith-ng:   notice: handle_request:  Client crmd.2199.76b55dfe wants to fence (reboot) 'linx60149' with device '(any)'
Sep 25 19:34:08 [2194] linx60147 stonith-ng:   notice: initiate_remote_stonith_op:      Requesting peer fencing (reboot) of linx60149 | id=07b318da-0c28-476a-a9f3-d73d7a5142dc state=0
Sep 25 19:34:08 [2199] linx60147       crmd:   notice: te_rsc_command:  Initiating start operation st-lxha_start_0 locally on linx60147 | action 13
Sep 25 19:34:08 [2199] linx60147       crmd:     info: do_lrm_rsc_op:   Performing key=13:2:0:05c1e621-d48e-4854-a666-4c664da9e32d op=st-lxha_start_0
Sep 25 19:34:08 [2195] linx60147       lrmd:     info: log_execute:     executing - rsc:st-lxha action:start call_id:18
Sep 25 19:34:08 [2194] linx60147 stonith-ng:     info: dynamic_list_search_cb:  Refreshing port list for st-lxha
Sep 25 19:34:08 [2194] linx60147 stonith-ng:     info: process_remote_stonith_query:    Query result 1 of 1 from linx60147 for linx60149/reboot (1 devices) 07b318da-0c28-476a-a9f3-d73d7a5142dc
Sep 25 19:34:08 [2194] linx60147 stonith-ng:     info: process_remote_stonith_query:    All query replies have arrived, continuing (1 expected/1 received)
Sep 25 19:34:08 [2194] linx60147 stonith-ng:     info: call_remote_stonith:     Total timeout set to 60 for peer's fencing of linx60149 for crmd.2199|id=07b318da-0c28-476a-a9f3-d73d7a5142dc
Sep 25 19:34:08 [2194] linx60147 stonith-ng:     info: call_remote_stonith:     Requesting that 'linx60147' perform op 'linx60149 reboot' for crmd.2199 (72s, 0s)
Sep 25 19:34:08 [2194] linx60147 stonith-ng:   notice: can_fence_host_with_device:      st-lxha can fence (reboot) linx60149: dynamic-list
Sep 25 19:34:08 [2194] linx60147 stonith-ng:     info: stonith_fence_get_devices_cb:    Found 1 matching devices for 'linx60149'
Sep 25 19:34:09 [2195] linx60147       lrmd:     info: log_finished:    finished - rsc:st-lxha action:start call_id:18  exit-code:0 exec-time:1024ms queue-time:0ms
Sep 25 19:34:09 [2199] linx60147       crmd:   notice: process_lrm_event:       Result of start operation for st-lxha on linx60147: 0 (ok) | call=18 key=st-lxha_start_0 confirmed=true cib-update=51
Sep 25 19:34:09 [2193] linx60147        cib:     info: cib_process_request:     Forwarding cib_modify operation for section status to all (origin=local/crmd/51)
Sep 25 19:34:09 [2193] linx60147        cib:     info: cib_perform_op:  Diff: --- 0.102.21 2
Sep 25 19:34:09 [2193] linx60147        cib:     info: cib_perform_op:  Diff: +++ 0.102.22 (null)
Sep 25 19:34:09 [2193] linx60147        cib:     info: cib_perform_op:  +  /cib:  @num_updates=22
Sep 25 19:34:09 [2193] linx60147        cib:     info: cib_perform_op:  +  /cib/status/node_state[@id='3232244115']:  @crm-debug-origin=do_update_resource
Sep 25 19:34:09 [2193] linx60147        cib:     info: cib_perform_op:  +  /cib/status/node_state[@id='3232244115']/lrm[@id='3232244115']/lrm_resources/lrm_resource[@id='st-lxha']/lrm_rsc_op[@id='st-lxha_last_0']:  @operation_key=st-lxha_start_0, @operation=start, @transition-key=13:2:0:05c1e621-d48e-4854-a666-4c664da9e32d, @transition-magic=0:0;13:2:0:05c1e621-d48e-4854-a666-4c664da9e32d, @call-id=18, @rc-code=0, @last-run=1537875248, @last-rc-change=1537875248, @exec-time=1024
Sep 25 19:34:09 [2193] linx60147        cib:     info: cib_process_request:     Completed cib_modify operation for section status: OK (rc=0, origin=linx60147/crmd/51, version=0.102.22)
Sep 25 19:34:09 [2199] linx60147       crmd:     info: match_graph_event:       Action st-lxha_start_0 (13) confirmed on linx60147 (rc=0)

linux60149  corosync.log:

Sep 25 19:34:07 [2144] linx60149    pengine:   notice: unpack_config:   On loss of CCM Quorum: Ignore
Sep 25 19:34:07 [2144] linx60149    pengine:     info: determine_online_status_fencing: Node linx60149 is active
Sep 25 19:34:07 [2144] linx60149    pengine:     info: determine_online_status: Node linx60149 is online
Sep 25 19:34:07 [2144] linx60149    pengine:  warning: pe_fence_node:   Cluster node linx60147 will be fenced: peer is no longer part of the cluster
Sep 25 19:34:07 [2144] linx60149    pengine:  warning: determine_online_status: Node linx60147 is unclean
Sep 25 19:34:07 [2144] linx60149    pengine:     info: unpack_node_loop:        Node 3232244117 is already processed
Sep 25 19:34:07 [2144] linx60149    pengine:     info: unpack_node_loop:        Node 3232244115 is already processed
Sep 25 19:34:07 [2144] linx60149    pengine:     info: unpack_node_loop:        Node 3232244117 is already processed
Sep 25 19:34:07 [2144] linx60149    pengine:     info: unpack_node_loop:        Node 3232244115 is already processed
Sep 25 19:34:07 [2144] linx60149    pengine:     info: group_print:      Resource Group: rsc-group
Sep 25 19:34:07 [2144] linx60149    pengine:     info: common_print:         rsc-vip-public     (ocf::heartbeat:IPaddr):        Started linx60147 (UNCLEAN)
Sep 25 19:34:07 [2144] linx60149    pengine:     info: common_print:         rsc-cpu    (ocf::pacemaker:HealthCPU):     Started linx60147 (UNCLEAN)
Sep 25 19:34:07 [2144] linx60149    pengine:     info: common_print:    st-lxha (stonith:external/ssh): Started linx60149
Sep 25 19:34:07 [2144] linx60149    pengine:  warning: custom_action:   Action rsc-vip-public_stop_0 on linx60147 is unrunnable (offline)
Sep 25 19:34:07 [2144] linx60149    pengine:     info: RecurringOp:      Start recurring monitor (30s) for rsc-vip-public on linx60149
Sep 25 19:34:07 [2144] linx60149    pengine:  warning: custom_action:   Action rsc-cpu_stop_0 on linx60147 is unrunnable (offline)
Sep 25 19:34:07 [2144] linx60149    pengine:     info: RecurringOp:      Start recurring monitor (30s) for rsc-cpu on linx60149
Sep 25 19:34:07 [2144] linx60149    pengine:  warning: stage6:  Scheduling Node linx60147 for STONITH
Sep 25 19:34:07 [2144] linx60149    pengine:     info: native_stop_constraints: rsc-vip-public_stop_0 is implicit after linx60147 is fenced
Sep 25 19:34:07 [2144] linx60149    pengine:     info: native_stop_constraints: rsc-cpu_stop_0 is implicit after linx60147 is fenced
Sep 25 19:34:07 [2144] linx60149    pengine:   notice: LogNodeActions:   * Fence linx60147
Sep 25 19:34:07 [2144] linx60149    pengine:   notice: LogActions:      Move    rsc-vip-public  (Started linx60147 -> linx60149)
Sep 25 19:34:07 [2144] linx60149    pengine:   notice: LogActions:      Move    rsc-cpu (Started linx60147 -> linx60149)
Sep 25 19:34:07 [2144] linx60149    pengine:     info: LogActions:      Leave   st-lxha (Started linx60149)
Sep 25 19:34:07 [2144] linx60149    pengine:  warning: process_pe_message:      Calculated transition 0 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-52.bz2
Sep 25 19:34:07 [2145] linx60149       crmd:     info: do_state_transition:     State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response
Sep 25 19:34:07 [2145] linx60149       crmd:     info: do_te_invoke:    Processing graph 0 (ref=pe_calc-dc-1537875247-15) derived from /var/lib/pacemaker/pengine/pe-warn-52.bz2
Sep 25 19:34:07 [2145] linx60149       crmd:   notice: te_fence_node:   Requesting fencing (reboot) of node linx60147 | action=15 timeout=60000
Sep 25 19:34:07 [2141] linx60149 stonith-ng:   notice: handle_request:  Client crmd.2145.321125df wants to fence (reboot) 'linx60147' with device '(any)'
Sep 25 19:34:07 [2141] linx60149 stonith-ng:   notice: initiate_remote_stonith_op:      Requesting peer fencing (reboot) of linx60147 | id=05d67c3b-8ff2-4e8d-b56f-abb305d3133c state=0
Sep 25 19:34:07 [2141] linx60149 stonith-ng:     info: dynamic_list_search_cb:  Refreshing port list for st-lxha
Sep 25 19:34:07 [2141] linx60149 stonith-ng:     info: process_remote_stonith_query:    Query result 1 of 1 from linx60149 for linx60147/reboot (1 devices) 05d67c3b-8ff2-4e8d-b56f-abb305d3133c
Sep 25 19:34:07 [2141] linx60149 stonith-ng:     info: call_remote_stonith:     Total timeout set to 60 for peer's fencing of linx60147 for crmd.2145|id=05d67c3b-8ff2-4e8d-b56f-abb305d3133c
Sep 25 19:34:07 [2141] linx60149 stonith-ng:     info: call_remote_stonith:     Requesting that 'linx60149' perform op 'linx60147 reboot' for crmd.2145 (72s, 0s)
Sep 25 19:34:07 [2141] linx60149 stonith-ng:   notice: can_fence_host_with_device:      st-lxha can fence (reboot) linx60147: dynamic-list
Sep 25 19:34:07 [2141] linx60149 stonith-ng:     info: stonith_fence_get_devices_cb:    Found 1 matching devices for 'linx60147'

Is this behavior of cluster  normal? Or is it configured with errors? How can I avoid it?

Thanks,

    zhongbin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/developers/attachments/20180925/72f4bb0a/attachment-0002.html>