[Pacemaker] help deciphering output

Alexandre alxgomz at gmail.com
Thu Oct 9 02:06:20 EDT 2014


I have seen this behavior on several virtualsed environments. when vm
backup starts, the VM actually freezes for a (short?) Period of time.I
guess it then no more responding to the other cluster nodes thus triggering
unexpected fail over and/or fencing.I have this kind of behavior on VMware
env using veam backup, as well promox (+ u don't what backup tool)
That's actually an interesting topic I never though about rising here.
How can we avoid that? Increasing timeout? I am afraid we would have to
reach unacceptable high timeout values and am not even sure that would fix
the pb.
I think not all VM snapshots strategy would trigger that PV, do you guys
have any feedback to provide on the backup/snapshot method best suits
corosync clusters?

Regards
Le 9 oct. 2014 01:24, "Alex Samad - Yieldbroker" <Alex.Samad at yieldbroker.com>
a écrit :

> One of my nodes died in a 2 node cluster
>
> I gather something went wrong, and it fenced/killed itself. But I am not
> sure what happened.
>
> I think maybe around that time the VM backups happened and snap of the VM
> could have happened
>
> But there is nothing for me to put my finger on
>
> Output from messages around that time
>
> This is on devrp1
> Oct  8 23:31:38 devrp1 corosync[1670]:   [TOTEM ] A processor failed,
> forming new configuration.
> Oct  8 23:31:40 devrp1 corosync[1670]:   [CMAN  ] quorum lost, blocking
> activity
> Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] This node is within the
> non-primary component and will NOT provide any services.
> Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] Members[1]: 1
> Oct  8 23:31:40 devrp1 corosync[1670]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Oct  8 23:31:40 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender
> r(0) ip(10.172.214.51) ; members(old:2 left:1)
> Oct  8 23:31:40 devrp1 corosync[1670]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Oct  8 23:31:41 devrp1 kernel: dlm: closing connection to node 2
> Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback:
> Membership 424: quorum lost
> Oct  8 23:31:42 devrp1 corosync[1670]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Oct  8 23:31:42 devrp1 corosync[1670]:   [CMAN  ] quorum regained,
> resuming activity
> Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] This node is within the
> primary component and will provide service.
> Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
> Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
> Oct  8 23:31:42 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender
> r(0) ip(10.172.214.51) ; members(old:1 left:0)
> Oct  8 23:31:42 devrp1 corosync[1670]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state:
> cman_event_callback: Node devrp2[2] - state is now lost (was member)
> Oct  8 23:31:42 devrp1 crmd[2350]:  warning: reap_dead_nodes: Our DC node
> (devrp2) left the cluster
> Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback:
> Membership 428: quorum acquired
> Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state:
> cman_event_callback: Node devrp2[2] - state is now member (was lost)
> Oct  8 23:31:42 devrp1 crmd[2350]:   notice: do_state_transition: State
> transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL
> origin=reap_dead_nodes ]
> Oct  8 23:31:42 devrp1 corosync[1670]: cman killed by node 2 because we
> were killed by cman_tool or other application
> Oct  8 23:31:42 devrp1 pacemakerd[2339]:    error: pcmk_cpg_dispatch:
> Connection to the CPG API failed: Library error (2)
> Oct  8 23:31:42 devrp1 stonith-ng[2346]:    error: pcmk_cpg_dispatch:
> Connection to the CPG API failed: Library error (2)
> Oct  8 23:31:42 devrp1 crmd[2350]:    error: pcmk_cpg_dispatch: Connection
> to the CPG API failed: Library error (2)
> Oct  8 23:31:42 devrp1 crmd[2350]:    error: crmd_cs_destroy: connection
> terminated
> Oct  8 23:31:43 devrp1 fenced[1726]: cluster is down, exiting
> Oct  8 23:31:43 devrp1 fenced[1726]: daemon cpg_dispatch error 2
> Oct  8 23:31:43 devrp1 attrd[2348]:    error: pcmk_cpg_dispatch:
> Connection to the CPG API failed: Library error (2)
> Oct  8 23:31:43 devrp1 attrd[2348]:     crit: attrd_cs_destroy: Lost
> connection to Corosync service!
> Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Exiting...
> Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Disconnecting client
> 0x18cf240, pid=2350...
> Oct  8 23:31:43 devrp1 pacemakerd[2339]:    error: mcp_cpg_destroy:
> Connection destroyed
> Oct  8 23:31:43 devrp1 cib[2345]:    error: pcmk_cpg_dispatch: Connection
> to the CPG API failed: Library error (2)
> Oct  8 23:31:43 devrp1 cib[2345]:    error: cib_cs_destroy: Corosync
> connection lost!  Exiting.
> Oct  8 23:31:43 devrp1 stonith-ng[2346]:    error:
> stonith_peer_cs_destroy: Corosync connection terminated
> Oct  8 23:31:43 devrp1 dlm_controld[1752]: cluster is down, exiting
> Oct  8 23:31:43 devrp1 dlm_controld[1752]: daemon cpg_dispatch error 2
> Oct  8 23:31:43 devrp1 gfs_controld[1801]: cluster is down, exiting
> Oct  8 23:31:43 devrp1 crmd[2350]:   notice: crmd_exit: Forcing immediate
> exit: Link has been severed (67)
> Oct  8 23:31:43 devrp1 attrd[2348]:    error:
> attrd_cib_connection_destroy: Connection to the CIB terminated...
> Oct  8 23:31:43 devrp1 lrmd[2347]:  warning: qb_ipcs_event_sendv:
> new_event_notification (2347-2350-6): Bad file descriptor (9)
> Oct  8 23:31:43 devrp1 lrmd[2347]:  warning: send_client_notify:
> Notification of client crmd/94e94935-2221-434d-8a6f-90eba4ede55b failed
> Oct  8 23:31:43 devrp1 lrmd[2347]:  warning: send_client_notify:
> Notification of client crmd/94e94935-2221-434d-8a6f-90eba4ede55b failed
>
>
> Devrp2
> Oct  8 23:31:26 devrp2 kernel: IN=eth0 OUT=
> MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00 SRC=10.0.0.0 DST=224.0.0.1
> LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2
> Oct  8 23:31:42 devrp2 crmd[2361]:   notice: cman_event_callback:
> Membership 428: quorum lost
> Oct  8 23:31:43 devrp2 fenced[1747]: telling cman to remove nodeid 1 from
> cluster
> Oct  8 23:31:45 devrp2 corosync[1685]:   [TOTEM ] A processor failed,
> forming new configuration.
> Oct  8 23:31:45 devrp2 corosync[1685]:   [CMAN  ] quorum lost, blocking
> activity
> Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] This node is within the
> non-primary component and will NOT provide any services.
> Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] Members[1]: 2
> Oct  8 23:31:45 devrp2 corosync[1685]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Oct  8 23:31:45 devrp2 corosync[1685]:   [CMAN  ] quorum regained,
> resuming activity
> Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] This node is within the
> primary component and will provide service.
> Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] Members[2]: 1 2
> Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] Members[2]: 1 2
> Oct  8 23:31:45 devrp2 corosync[1685]:   [CPG   ] chosen downlist: sender
> r(0) ip(10.172.214.52) ; members(old:2 left:1)
> Oct  8 23:31:45 devrp2 corosync[1685]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Oct  8 23:31:45 devrp2 fenced[1747]: receive_start 1:3 add node with
> started_count 1
> Oct  8 23:31:45 devrp2 crmd[2361]:   notice: cman_event_callback:
> Membership 428: quorum acquired
> Oct  8 23:31:45 devrp2 crmd[2361]:   notice: crm_update_peer_state:
> cman_event_callback: Node devrp1[1] - state is now lost (was member)
> Oct  8 23:31:45 devrp2 crmd[2361]:  warning: match_down_event: No match
> for shutdown action on devrp1
> Oct  8 23:31:45 devrp2 crmd[2361]:   notice: peer_update_callback:
> Stonith/shutdown of devrp1 not matched
> Oct  8 23:31:49 devrp2 crmd[2361]:   notice: do_state_transition: State
> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
> origin=check_join_state ]
> Oct  8 23:31:49 devrp2 crmd[2361]:  warning: match_down_event: No match
> for shutdown action on devrp1
> Oct  8 23:31:49 devrp2 crmd[2361]:   notice: peer_update_callback:
> Stonith/shutdown of devrp1 not matched
> Oct  8 23:31:49 devrp2 crmd[2361]:   notice: do_election_count_vote:
> Election 2 (current: 2, owner: devrp1): Processed vote from devrp1 (Peer is
> not part of our cluster)
> Oct  8 23:31:49 devrp2 pengine[2360]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct  8 23:31:49 devrp2 kernel: dlm: closing connection to node 1
> Oct  8 23:31:49 devrp2 attrd[2359]:   notice: attrd_local_callback:
> Sending full refresh (origin=crmd)
> Oct  8 23:31:49 devrp2 attrd[2359]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-ybrpstat (1412658023)
> Oct  8 23:31:49 devrp2 dlm_controld[1773]:
> /sys/kernel/config/dlm/cluster/comms/1: rmdir failed: 2
> Oct  8 23:31:49 devrp2 attrd[2359]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: probe_complete (true)
> Oct  8 23:31:49 devrp2 pengine[2360]:   notice: LogActions: Start
>  ybrpip#011(devrp2)
> Oct  8 23:31:49 devrp2 pengine[2360]:   notice: process_pe_message:
> Calculated Transition 1061: /var/lib/pacemaker/pengine/pe-input-2282.bz2
> Oct  8 23:31:50 devrp2 pengine[2360]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct  8 23:31:50 devrp2 pengine[2360]:   notice: LogActions: Start
>  ybrpip#011(devrp2)
> Oct  8 23:31:50 devrp2 pengine[2360]:   notice: process_pe_message:
> Calculated Transition 1062: /var/lib/pacemaker/pengine/pe-input-2283.bz2
> Oct  8 23:31:50 devrp2 crmd[2361]:   notice: te_rsc_command: Initiating
> action 5: start ybrpip_start_0 on devrp2 (local)
> Oct  8 23:31:51 devrp2 IPaddr2(ybrpip)[30442]: INFO: Adding inet address
> 10.172.214.50/24 with broadcast address 10.172.214.255 to device eth0
> Oct  8 23:31:51 devrp2 IPaddr2(ybrpip)[30442]: INFO: Bringing device eth0
> up
> Oct  8 23:31:51 devrp2 IPaddr2(ybrpip)[30442]: INFO:
> /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p
> /var/run/resource-agents/send_arp-10.172.214.50 eth0 10.172.214.50 auto
> not_used not_used
> Oct  8 23:31:51 devrp2 crmd[2361]:   notice: process_lrm_event: LRM
> operation ybrpip_start_0 (call=995, rc=0, cib-update=1494, confirmed=true)
> ok
> Oct  8 23:31:51 devrp2 crmd[2361]:   notice: te_rsc_command: Initiating
> action 6: monitor ybrpip_monitor_5000 on devrp2 (local)
> Oct  8 23:31:51 devrp2 crmd[2361]:   notice: process_lrm_event: LRM
> operation ybrpip_monitor_5000 (call=998, rc=0, cib-update=1495,
> confirmed=false) ok
> Oct  8 23:31:51 devrp2 crmd[2361]:   notice: run_graph: Transition 1062
> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-2283.bz2): Complete
> Oct  8 23:31:51 devrp2 crmd[2361]:   notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Oct  8 23:31:52 devrp2 ntpd[2099]: Listen normally on 6 eth0 10.172.214.50
> UDP 123
> Oct  8 23:31:52 devrp2 ntpd[2099]: peers refreshed
> Oct  8 23:31:52 devrp2 corosync[1685]:   [TOTEM ] A processor failed,
> forming new configuration.
> Oct  8 23:31:54 devrp2 corosync[1685]:   [CMAN  ] quorum lost, blocking
> activity
> Oct  8 23:31:54 devrp2 corosync[1685]:   [QUORUM] This node is within the
> non-primary component and will NOT provide any services.
> Oct  8 23:31:54 devrp2 corosync[1685]:   [QUORUM] Members[1]: 2
> Oct  8 23:31:54 devrp2 corosync[1685]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Oct  8 23:31:54 devrp2 crmd[2361]:   notice: cman_event_callback:
> Membership 432: quorum lost
> Oct  8 23:31:54 devrp2 corosync[1685]:   [CPG   ] chosen downlist: sender
> r(0) ip(10.172.214.52) ; members(old:2 left:1)
> Oct  8 23:31:54 devrp2 corosync[1685]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Oct  8 23:31:54 devrp2 crmd[2361]:  warning: match_down_event: No match
> for shutdown action on devrp1
> Oct  8 23:31:54 devrp2 crmd[2361]:   notice: peer_update_callback:
> Stonith/shutdown of devrp1 not matched
> Oct  8 23:31:54 devrp2 crmd[2361]:   notice: do_state_transition: State
> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
> origin=check_join_state ]
> Oct  8 23:31:54 devrp2 attrd[2359]:   notice: attrd_local_callback:
> Sending full refresh (origin=crmd)
> Oct  8 23:31:54 devrp2 attrd[2359]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-ybrpstat (1412658023)
> Oct  8 23:31:55 devrp2 attrd[2359]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: probe_complete (true)
> Oct  8 23:31:55 devrp2 pengine[2360]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct  8 23:31:55 devrp2 pengine[2360]:   notice: process_pe_message:
> Calculated Transition 1063: /var/lib/pacemaker/pengine/pe-input-2284.bz2
> Oct  8 23:31:55 devrp2 crmd[2361]:   notice: run_graph: Transition 1063
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-2284.bz2): Complete
> Oct  8 23:31:55 devrp2 crmd[2361]:   notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Oct  8 23:31:59 devrp2 kernel: IN=eth0 OUT=
> MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00 SRC=10.0.0.0 DST=224.0.0.1
> LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2
> Oct  8 23:36:52 devrp2 kernel: IN=eth0 OUT=
> MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00 SRC=10.0.0.0 DST=224.0.0.1
> LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2
> Oct  8 23:46:55 devrp2 crmd[2361]:   notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
> origin=crm_timer_popped ]
> Oct  8 23:46:55 devrp2 pengine[2360]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct  8 23:46:55 devrp2 pengine[2360]:   notice: process_pe_message:
> Calculated Transition 1064: /var/lib/pacemaker/pengine/pe-input-2284.bz2
> Oct  8 23:46:55 devrp2 crmd[2361]:   notice: run_graph: Transition 1064
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-2284.bz2): Complete
> Oct  8 23:46:55 devrp2 crmd[2361]:   notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141009/d8ec3bbf/attachment-0003.html>


More information about the Pacemaker mailing list