<p dir="ltr">I have seen this behavior on several virtualsed environments. when vm backup starts, the VM actually freezes for a (short?) Period of time.I guess it then no more responding to the other cluster nodes thus triggering unexpected fail over and/or fencing.I have this kind of behavior on VMware env using veam backup, as well promox (+ u don't what backup tool)<br>

That's actually an interesting topic I never though about rising here.<br>

How can we avoid that? Increasing timeout? I am afraid we would have to reach unacceptable high timeout values and am not even sure that would fix the pb.<br>

I think not all VM snapshots strategy would trigger that PV, do you guys have any feedback to provide on the backup/snapshot method best suits corosync clusters?</p>

<p dir="ltr">Regards</p>

<div class="gmail_quote">Le 9 oct. 2014 01:24, "Alex Samad - Yieldbroker" <<a href="mailto:Alex.Samad@yieldbroker.com">Alex.Samad@yieldbroker.com</a>> a écrit :<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">One of my nodes died in a 2 node cluster<br>

<br>

I gather something went wrong, and it fenced/killed itself. But I am not sure what happened.<br>

<br>

I think maybe around that time the VM backups happened and snap of the VM could have happened<br>

<br>

But there is nothing for me to put my finger on<br>

<br>

Output from messages around that time<br>

<br>

This is on devrp1<br>

Oct  8 23:31:38 devrp1 corosync[1670]:   [TOTEM ] A processor failed, forming new configuration.<br>

Oct  8 23:31:40 devrp1 corosync[1670]:   [CMAN  ] quorum lost, blocking activity<br>

Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.<br>

Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] Members[1]: 1<br>

Oct  8 23:31:40 devrp1 corosync[1670]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.<br>

Oct  8 23:31:40 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender r(0) ip(10.172.214.51) ; members(old:2 left:1)<br>

Oct  8 23:31:40 devrp1 corosync[1670]:   [MAIN  ] Completed service synchronization, ready to provide service.<br>

Oct  8 23:31:41 devrp1 kernel: dlm: closing connection to node 2<br>

Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback: Membership 424: quorum lost<br>

Oct  8 23:31:42 devrp1 corosync[1670]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.<br>

Oct  8 23:31:42 devrp1 corosync[1670]:   [CMAN  ] quorum regained, resuming activity<br>

Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] This node is within the primary component and will provide service.<br>

Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2<br>

Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2<br>

Oct  8 23:31:42 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender r(0) ip(10.172.214.51) ; members(old:1 left:0)<br>

Oct  8 23:31:42 devrp1 corosync[1670]:   [MAIN  ] Completed service synchronization, ready to provide service.<br>

Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state: cman_event_callback: Node devrp2[2] - state is now lost (was member)<br>

Oct  8 23:31:42 devrp1 crmd[2350]:  warning: reap_dead_nodes: Our DC node (devrp2) left the cluster<br>

Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback: Membership 428: quorum acquired<br>

Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state: cman_event_callback: Node devrp2[2] - state is now member (was lost)<br>

Oct  8 23:31:42 devrp1 crmd[2350]:   notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=reap_dead_nodes ]<br>

Oct  8 23:31:42 devrp1 corosync[1670]: cman killed by node 2 because we were killed by cman_tool or other application<br>

Oct  8 23:31:42 devrp1 pacemakerd[2339]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)<br>

Oct  8 23:31:42 devrp1 stonith-ng[2346]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)<br>

Oct  8 23:31:42 devrp1 crmd[2350]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)<br>

Oct  8 23:31:42 devrp1 crmd[2350]:    error: crmd_cs_destroy: connection terminated<br>

Oct  8 23:31:43 devrp1 fenced[1726]: cluster is down, exiting<br>

Oct  8 23:31:43 devrp1 fenced[1726]: daemon cpg_dispatch error 2<br>

Oct  8 23:31:43 devrp1 attrd[2348]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)<br>

Oct  8 23:31:43 devrp1 attrd[2348]:     crit: attrd_cs_destroy: Lost connection to Corosync service!<br>

Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Exiting...<br>

Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Disconnecting client 0x18cf240, pid=2350...<br>

Oct  8 23:31:43 devrp1 pacemakerd[2339]:    error: mcp_cpg_destroy: Connection destroyed<br>

Oct  8 23:31:43 devrp1 cib[2345]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)<br>

Oct  8 23:31:43 devrp1 cib[2345]:    error: cib_cs_destroy: Corosync connection lost!  Exiting.<br>

Oct  8 23:31:43 devrp1 stonith-ng[2346]:    error: stonith_peer_cs_destroy: Corosync connection terminated<br>

Oct  8 23:31:43 devrp1 dlm_controld[1752]: cluster is down, exiting<br>

Oct  8 23:31:43 devrp1 dlm_controld[1752]: daemon cpg_dispatch error 2<br>

Oct  8 23:31:43 devrp1 gfs_controld[1801]: cluster is down, exiting<br>

Oct  8 23:31:43 devrp1 crmd[2350]:   notice: crmd_exit: Forcing immediate exit: Link has been severed (67)<br>

Oct  8 23:31:43 devrp1 attrd[2348]:    error: attrd_cib_connection_destroy: Connection to the CIB terminated...<br>

Oct  8 23:31:43 devrp1 lrmd[2347]:  warning: qb_ipcs_event_sendv: new_event_notification (2347-2350-6): Bad file descriptor (9)<br>

Oct  8 23:31:43 devrp1 lrmd[2347]:  warning: send_client_notify: Notification of client crmd/94e94935-2221-434d-8a6f-90eba4ede55b failed<br>

Oct  8 23:31:43 devrp1 lrmd[2347]:  warning: send_client_notify: Notification of client crmd/94e94935-2221-434d-8a6f-90eba4ede55b failed<br>

<br>

<br>

Devrp2<br>

Oct  8 23:31:26 devrp2 kernel: IN=eth0 OUT= MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00 SRC=10.0.0.0 DST=224.0.0.1 LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2<br>

Oct  8 23:31:42 devrp2 crmd[2361]:   notice: cman_event_callback: Membership 428: quorum lost<br>

Oct  8 23:31:43 devrp2 fenced[1747]: telling cman to remove nodeid 1 from cluster<br>

Oct  8 23:31:45 devrp2 corosync[1685]:   [TOTEM ] A processor failed, forming new configuration.<br>

Oct  8 23:31:45 devrp2 corosync[1685]:   [CMAN  ] quorum lost, blocking activity<br>

Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.<br>

Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] Members[1]: 2<br>

Oct  8 23:31:45 devrp2 corosync[1685]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.<br>

Oct  8 23:31:45 devrp2 corosync[1685]:   [CMAN  ] quorum regained, resuming activity<br>

Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] This node is within the primary component and will provide service.<br>

Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] Members[2]: 1 2<br>

Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] Members[2]: 1 2<br>

Oct  8 23:31:45 devrp2 corosync[1685]:   [CPG   ] chosen downlist: sender r(0) ip(10.172.214.52) ; members(old:2 left:1)<br>

Oct  8 23:31:45 devrp2 corosync[1685]:   [MAIN  ] Completed service synchronization, ready to provide service.<br>

Oct  8 23:31:45 devrp2 fenced[1747]: receive_start 1:3 add node with started_count 1<br>

Oct  8 23:31:45 devrp2 crmd[2361]:   notice: cman_event_callback: Membership 428: quorum acquired<br>

Oct  8 23:31:45 devrp2 crmd[2361]:   notice: crm_update_peer_state: cman_event_callback: Node devrp1[1] - state is now lost (was member)<br>

Oct  8 23:31:45 devrp2 crmd[2361]:  warning: match_down_event: No match for shutdown action on devrp1<br>

Oct  8 23:31:45 devrp2 crmd[2361]:   notice: peer_update_callback: Stonith/shutdown of devrp1 not matched<br>

Oct  8 23:31:49 devrp2 crmd[2361]:   notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ]<br>

Oct  8 23:31:49 devrp2 crmd[2361]:  warning: match_down_event: No match for shutdown action on devrp1<br>

Oct  8 23:31:49 devrp2 crmd[2361]:   notice: peer_update_callback: Stonith/shutdown of devrp1 not matched<br>

Oct  8 23:31:49 devrp2 crmd[2361]:   notice: do_election_count_vote: Election 2 (current: 2, owner: devrp1): Processed vote from devrp1 (Peer is not part of our cluster)<br>

Oct  8 23:31:49 devrp2 pengine[2360]:   notice: unpack_config: On loss of CCM Quorum: Ignore<br>

Oct  8 23:31:49 devrp2 kernel: dlm: closing connection to node 1<br>

Oct  8 23:31:49 devrp2 attrd[2359]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)<br>

Oct  8 23:31:49 devrp2 attrd[2359]:   notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ybrpstat (1412658023)<br>

Oct  8 23:31:49 devrp2 dlm_controld[1773]: /sys/kernel/config/dlm/cluster/comms/1: rmdir failed: 2<br>

Oct  8 23:31:49 devrp2 attrd[2359]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)<br>

Oct  8 23:31:49 devrp2 pengine[2360]:   notice: LogActions: Start   ybrpip#011(devrp2)<br>

Oct  8 23:31:49 devrp2 pengine[2360]:   notice: process_pe_message: Calculated Transition 1061: /var/lib/pacemaker/pengine/pe-input-2282.bz2<br>

Oct  8 23:31:50 devrp2 pengine[2360]:   notice: unpack_config: On loss of CCM Quorum: Ignore<br>

Oct  8 23:31:50 devrp2 pengine[2360]:   notice: LogActions: Start   ybrpip#011(devrp2)<br>

Oct  8 23:31:50 devrp2 pengine[2360]:   notice: process_pe_message: Calculated Transition 1062: /var/lib/pacemaker/pengine/pe-input-2283.bz2<br>

Oct  8 23:31:50 devrp2 crmd[2361]:   notice: te_rsc_command: Initiating action 5: start ybrpip_start_0 on devrp2 (local)<br>

Oct  8 23:31:51 devrp2 IPaddr2(ybrpip)[30442]: INFO: Adding inet address <a href="http://10.172.214.50/24" target="_blank">10.172.214.50/24</a> with broadcast address 10.172.214.255 to device eth0<br>

Oct  8 23:31:51 devrp2 IPaddr2(ybrpip)[30442]: INFO: Bringing device eth0 up<br>

Oct  8 23:31:51 devrp2 IPaddr2(ybrpip)[30442]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-10.172.214.50 eth0 10.172.214.50 auto not_used not_used<br>

Oct  8 23:31:51 devrp2 crmd[2361]:   notice: process_lrm_event: LRM operation ybrpip_start_0 (call=995, rc=0, cib-update=1494, confirmed=true) ok<br>

Oct  8 23:31:51 devrp2 crmd[2361]:   notice: te_rsc_command: Initiating action 6: monitor ybrpip_monitor_5000 on devrp2 (local)<br>

Oct  8 23:31:51 devrp2 crmd[2361]:   notice: process_lrm_event: LRM operation ybrpip_monitor_5000 (call=998, rc=0, cib-update=1495, confirmed=false) ok<br>

Oct  8 23:31:51 devrp2 crmd[2361]:   notice: run_graph: Transition 1062 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2283.bz2): Complete<br>

Oct  8 23:31:51 devrp2 crmd[2361]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]<br>

Oct  8 23:31:52 devrp2 ntpd[2099]: Listen normally on 6 eth0 10.172.214.50 UDP 123<br>

Oct  8 23:31:52 devrp2 ntpd[2099]: peers refreshed<br>

Oct  8 23:31:52 devrp2 corosync[1685]:   [TOTEM ] A processor failed, forming new configuration.<br>

Oct  8 23:31:54 devrp2 corosync[1685]:   [CMAN  ] quorum lost, blocking activity<br>

Oct  8 23:31:54 devrp2 corosync[1685]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.<br>

Oct  8 23:31:54 devrp2 corosync[1685]:   [QUORUM] Members[1]: 2<br>

Oct  8 23:31:54 devrp2 corosync[1685]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.<br>

Oct  8 23:31:54 devrp2 crmd[2361]:   notice: cman_event_callback: Membership 432: quorum lost<br>

Oct  8 23:31:54 devrp2 corosync[1685]:   [CPG   ] chosen downlist: sender r(0) ip(10.172.214.52) ; members(old:2 left:1)<br>

Oct  8 23:31:54 devrp2 corosync[1685]:   [MAIN  ] Completed service synchronization, ready to provide service.<br>

Oct  8 23:31:54 devrp2 crmd[2361]:  warning: match_down_event: No match for shutdown action on devrp1<br>

Oct  8 23:31:54 devrp2 crmd[2361]:   notice: peer_update_callback: Stonith/shutdown of devrp1 not matched<br>

Oct  8 23:31:54 devrp2 crmd[2361]:   notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ]<br>

Oct  8 23:31:54 devrp2 attrd[2359]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)<br>

Oct  8 23:31:54 devrp2 attrd[2359]:   notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ybrpstat (1412658023)<br>

Oct  8 23:31:55 devrp2 attrd[2359]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)<br>

Oct  8 23:31:55 devrp2 pengine[2360]:   notice: unpack_config: On loss of CCM Quorum: Ignore<br>

Oct  8 23:31:55 devrp2 pengine[2360]:   notice: process_pe_message: Calculated Transition 1063: /var/lib/pacemaker/pengine/pe-input-2284.bz2<br>

Oct  8 23:31:55 devrp2 crmd[2361]:   notice: run_graph: Transition 1063 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2284.bz2): Complete<br>

Oct  8 23:31:55 devrp2 crmd[2361]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]<br>

Oct  8 23:31:59 devrp2 kernel: IN=eth0 OUT= MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00 SRC=10.0.0.0 DST=224.0.0.1 LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2<br>

Oct  8 23:36:52 devrp2 kernel: IN=eth0 OUT= MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00 SRC=10.0.0.0 DST=224.0.0.1 LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2<br>

Oct  8 23:46:55 devrp2 crmd[2361]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]<br>

Oct  8 23:46:55 devrp2 pengine[2360]:   notice: unpack_config: On loss of CCM Quorum: Ignore<br>

Oct  8 23:46:55 devrp2 pengine[2360]:   notice: process_pe_message: Calculated Transition 1064: /var/lib/pacemaker/pengine/pe-input-2284.bz2<br>

Oct  8 23:46:55 devrp2 crmd[2361]:   notice: run_graph: Transition 1064 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2284.bz2): Complete<br>

Oct  8 23:46:55 devrp2 crmd[2361]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]<br>

<br>

_______________________________________________<br>

Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>

</blockquote></div>