[Pacemaker] More Diagnosis help

Alex Samad - Yieldbroker Alex.Samad at yieldbroker.com
Fri Oct 31 16:07:31 EDT 2014


Hi

Had another node die


Everything is looking good, I am guessing corrosync tried to talk to the other node and it failed, I believe 

Nov  1 00:08:48 demorp2 ntpd[2461]: peers refreshed
Nov  1 00:08:51 demorp2 corosync[2039]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov  1 00:08:51 demorp2 corosync[2039]:   [CPG   ] chosen downlist: sender r(0) ip(10.172.218.52) ; members(old:1 left:0)
Nov  1 00:08:51 demorp2 corosync[2039]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov  1 00:09:05 demorp2 corosync[2039]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov  1 00:09:05 demorp2 corosync[2039]:   [CMAN  ] quorum regained, resuming activity
Nov  1 00:09:05 demorp2 corosync[2039]:   [QUORUM] This node is within the primary component and will provide service.
Nov  1 00:09:05 demorp2 corosync[2039]:   [QUORUM] Members[2]: 1 2
Nov  1 00:09:05 demorp2 corosync[2039]:   [QUORUM] Members[2]: 1 2
Nov  1 00:09:05 demorp2 crmd[2725]:   notice: cman_event_callback: Membership 320: quorum acquired
Nov  1 00:09:05 demorp2 crmd[2725]:   notice: crm_update_peer_state: cman_event_callback: Node demorp1[1] - state is now member (was lost)
Nov  1 00:09:05 demorp2 corosync[2039]:   [CPG   ] chosen downlist: sender r(0) ip(10.172.218.52) ; members(old:1 left:0)
Nov  1 00:09:05 demorp2 corosync[2039]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov  1 00:09:05 demorp2 crmd[2725]:   notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]
Nov  1 00:09:06 demorp2 corosync[2039]: cman killed by node 1 because we were killed by cman_tool or other application
Nov  1 00:09:06 demorp2 attrd[2723]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Nov  1 00:09:06 demorp2 attrd[2723]:     crit: attrd_cs_destroy: Lost connection to Corosync service!
Nov  1 00:09:06 demorp2 attrd[2723]:   notice: main: Exiting...
Nov  1 00:09:06 demorp2 attrd[2723]:   notice: main: Disconnecting client 0xdc3020, pid=2725...
Nov  1 00:09:06 demorp2 pacemakerd[2712]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Nov  1 00:09:06 demorp2 pacemakerd[2712]:    error: mcp_cpg_destroy: Connection destroyed
Nov  1 00:09:06 demorp2 stonith-ng[2721]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Nov  1 00:09:06 demorp2 crmd[2725]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Nov  1 00:09:06 demorp2 crmd[2725]:    error: crmd_cs_destroy: connection terminated
Nov  1 00:09:06 demorp2 gfs_controld[2173]: cluster is down, exiting
Nov  1 00:09:06 demorp2 gfs_controld[2173]: daemon cpg_dispatch error 2
Nov  1 00:09:06 demorp2 attrd[2723]:    error: attrd_cib_connection_destroy: Connection to the CIB terminated...
Nov  1 00:09:06 demorp2 fenced[2098]: cluster is down, exiting
Nov  1 00:09:06 demorp2 fenced[2098]: daemon cpg_dispatch error 2
Nov  1 00:09:06 demorp2 dlm_controld[2124]: cluster is down, exiting
Nov  1 00:09:06 demorp2 dlm_controld[2124]: daemon cpg_dispatch error 2
Nov  1 00:09:06 demorp2 stonith-ng[2721]:    error: stonith_peer_cs_destroy: Corosync connection terminated
Nov  1 00:09:06 demorp2 cib[2720]:  warning: qb_ipcs_event_sendv: new_event_notification (2720-2721-11): Broken pipe (32)
Nov  1 00:09:06 demorp2 cib[2720]:  warning: cib_notify_send_one: Notification of client crmd/4c1076bf-8a95-4f77-b866-e1bbf5e2ceda failed
Nov  1 00:09:06 demorp2 cib[2720]:    error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Nov  1 00:09:06 demorp2 cib[2720]:    error: cib_cs_destroy: Corosync connection lost!  Exiting.
Nov  1 00:09:06 demorp2 crmd[2725]:   notice: crmd_exit: Forcing immediate exit: Link has been severed (67)
Nov  1 00:09:06 demorp2 lrmd[2722]:  warning: qb_ipcs_event_sendv: new_event_notification (2722-2725-6): Bad file descriptor (9)
Nov  1 00:09:06 demorp2 lrmd[2722]:  warning: send_client_notify: Notification of client crmd/3598d3e2-600a-4f15-aae2-e087437d6213 failed
Nov  1 00:09:06 demorp2 lrmd[2722]:  warning: send_client_notify: Notification of client crmd/3598d3e2-600a-4f15-aae2-e087437d6213 failed
Nov  1 00:09:08 demorp2 kernel: dlm: closing connection to node 1


The other node

It looks to me, like VMWare took too long to give this vm a time slice and corosync responded by killing one node


ov  1 00:08:50 demorp1 lrmd[2433]:  warning: child_timeout_callback: ybrpstat_monitor_5000 process (PID 32026) timed out
Nov  1 00:08:50 demorp1 lrmd[2433]:  warning: operation_finished: ybrpstat_monitor_5000:32026 - timed out after 20000ms
Nov  1 00:08:51 demorp1 crmd[2436]:    error: process_lrm_event: LRM operation ybrpstat_monitor_5000 (17) Timed Out (timeout=20000ms)
Nov  1 00:08:52 demorp1 crmd[2436]:   notice: process_lrm_event: demorp1-ybrpstat_monitor_5000:17 [ Service running for 18 hours 8 minutes 30 seconds.\n ]
Nov  1 00:08:53 demorp1 lrmd[2433]:  warning: child_timeout_callback: ybrpip_monitor_5000 process (PID 32033) timed out
Nov  1 00:08:53 demorp1 lrmd[2433]:  warning: operation_finished: ybrpip_monitor_5000:32033 - timed out after 20000ms
Nov  1 00:08:53 demorp1 crmd[2436]:    error: process_lrm_event: LRM operation ybrpip_monitor_5000 (22) Timed Out (timeout=20000ms)
Nov  1 00:09:05 demorp1 corosync[1748]:   [MAIN  ] Corosync main process was not scheduled for 16241.7002 ms (threshold is 8000.0000 ms). Consider token timeout increase.
Nov  1 00:09:05 demorp1 corosync[1748]:   [TOTEM ] A processor failed, forming new configuration.
Nov  1 00:09:05 demorp1 corosync[1748]:   [TOTEM ] Process pause detected for 15555 ms, flushing membership messages.
Nov  1 00:09:05 demorp1 corosync[1748]:   [MAIN  ] Corosync main process was not scheduled for 15555.0029 ms (threshold is 8000.0000 ms). Consider token timeout increase.
Nov  1 00:09:05 demorp1 corosync[1748]:   [CMAN  ] quorum lost, blocking activity
Nov  1 00:09:05 demorp1 corosync[1748]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov  1 00:09:05 demorp1 corosync[1748]:   [QUORUM] Members[1]: 1
Nov  1 00:09:05 demorp1 corosync[1748]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov  1 00:09:05 demorp1 corosync[1748]:   [CMAN  ] quorum regained, resuming activity
Nov  1 00:09:05 demorp1 corosync[1748]:   [QUORUM] This node is within the primary component and will provide service.
Nov  1 00:09:05 demorp1 corosync[1748]:   [QUORUM] Members[2]: 1 2
Nov  1 00:09:05 demorp1 corosync[1748]:   [QUORUM] Members[2]: 1 2
Nov  1 00:09:05 demorp1 corosync[1748]:   [CPG   ] chosen downlist: sender r(0) ip(10.172.218.51) ; members(old:2 left:1)
Nov  1 00:09:05 demorp1 corosync[1748]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov  1 00:09:05 demorp1 crmd[2436]:   notice: process_lrm_event: LRM operation ybrpip_monitor_5000 (call=22, rc=0, cib-update=17, confirmed=false) ok
Nov  1 00:09:05 demorp1 crmd[2436]:   notice: peer_update_callback: Our peer on the DC is dead
Nov  1 00:09:05 demorp1 crmd[2436]:   notice: cman_event_callback: Membership 320: quorum lost
Nov  1 00:09:05 demorp1 crmd[2436]:   notice: cman_event_callback: Membership 320: quorum acquired
Nov  1 00:09:05 demorp1 crmd[2436]:   notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callba
ck ]
Nov  1 00:09:05 demorp1 crmd[2436]:   notice: process_lrm_event: LRM operation ybrpstat_monitor_5000 (call=17, rc=0, cib-update=18, confirmed=false) ok
Nov  1 00:09:06 demorp1 crmd[2436]:  warning: do_log: FSA: Input I_JOIN_OFFER from route_message() received in state S_ELECTION
Nov  1 00:09:06 demorp1 crmd[2436]:   notice: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Nov  1 00:09:06 demorp1 fenced[1822]: telling cman to remove nodeid 2 from cluster
Nov  1 00:09:06 demorp1 fenced[1822]: receive_start 2:3 add node with started_count 1







More information about the Pacemaker mailing list