[ClusterLabs] Antw: Failover caused by internal error?
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Fri Nov 25 11:53:49 UTC 2016
Hi!
We have these retransmit lists with multipath also. I suspect that multicast communication is broken. Since moving to udpu at lease those messages are gone. In your case I suspect it was high network load that triggered the problems.
Ulrich
>>> "Sven Moeller" <smoeller at nichthelfer.de> schrieb am 25.11.2016 um 12:23 in
Nachricht <2b1-58381f80-b-377afb00 at 69899261>:
> Hi,
>
> today we've encountered a FailOver on our NFS Cluster. First suspicion was a
> hardware outtage. It was not. The failing node has been fenced (reboot). The
> Failover went as expected. So far so good. But by digging in the Logs of the
> failed node I found error messages regarding lrmd was not repsonding, crmd
> could not recover from internal error and generik pacemaker error (201). See
> logs below.
>
> It seems that one of two corosync rings were flapping. But this shouldn't be
> the cause for such an behavior?
>
> The Cluster is running on openSUSE 13.2, following packages are installed:
>
> # rpm -qa | grep -Ei "(cluster|pacemaker|coro)"
> pacemaker-1.1.12.git20140904.266d5c2-1.5.x86_64
> cluster-glue-1.0.12-14.2.1.x86_64
> corosync-2.3.4-1.2.x86_64
> pacemaker-cts-1.1.12.git20140904.266d5c2-1.5.x86_64
> libpacemaker3-1.1.12.git20140904.266d5c2-1.5.x86_64
> libcorosync4-2.3.4-1.2.x86_64
> pacemaker-cli-1.1.12.git20140904.266d5c2-1.5.x86_64
>
> Storage devices are connected via fibre channel using multipath.
>
> Regards,
> Sven
>
> 2016-11-25T10:42:49.499255+01:00 nfs2 systemd[1]: Reloading.
> 2016-11-25T10:42:56.333353+01:00 nfs2 corosync[30260]: [TOTEM ] Marking
> ringid 1 interface 10.x.x.x FAULTY
> 2016-11-25T10:42:57.334657+01:00 nfs2 corosync[30260]: [TOTEM ]
> Automatically recovered ring 1
> 2016-11-25T10:43:39.507268+01:00 nfs2 crmd[7661]: notice: process_lrm_event:
> Operation NFS-Server_monitor_30000: unknown error (node=nfs2, call=103, rc=1,
> cib-update=54, confirmed=false)
> 2016-11-25T10:43:39.521944+01:00 nfs2 crmd[7661]: error: crm_ipc_read:
> Connection to lrmd failed
> 2016-11-25T10:43:39.524644+01:00 nfs2 crmd[7661]: error:
> mainloop_gio_callback: Connection to lrmd[0x1128200] closed (I/O
> condition=17)
> 2016-11-25T10:43:39.525093+01:00 nfs2 pacemakerd[30267]: error:
> pcmk_child_exit: Child process lrmd (7660) exited: Operation not permitted
> (1)
> 2016-11-25T10:43:39.525554+01:00 nfs2 pacemakerd[30267]: notice:
> pcmk_process_exit: Respawning failed child process: lrmd
> 2016-11-25T10:43:39.525956+01:00 nfs2 crmd[7661]: crit:
> lrm_connection_destroy: LRM Connection failed
> 2016-11-25T10:43:39.526383+01:00 nfs2 crmd[7661]: error: do_log: FSA: Input
> I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
> 2016-11-25T10:43:39.526784+01:00 nfs2 crmd[7661]: notice:
> do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR
> cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
> 2016-11-25T10:43:39.527186+01:00 nfs2 crmd[7661]: warning: do_recover:
> Fast-tracking shutdown in response to errors
> 2016-11-25T10:43:39.527569+01:00 nfs2 crmd[7661]: error: do_log: FSA: Input
> I_TERMINATE from do_recover() received in state S_RECOVERY
> 2016-11-25T10:43:39.527952+01:00 nfs2 crmd[7661]: error:
> lrm_state_verify_stopped: 1 resources were active at shutdown.
> 2016-11-25T10:43:39.528330+01:00 nfs2 crmd[7661]: notice: do_lrm_control:
> Disconnected from the LRM
> 2016-11-25T10:43:39.528732+01:00 nfs2 crmd[7661]: notice:
> terminate_cs_connection: Disconnecting from Corosync
> 2016-11-25T10:43:39.547847+01:00 nfs2 lrmd[29607]: notice: crm_add_logfile:
> Additional logging available in /var/log/pacemaker.log
> 2016-11-25T10:43:39.637693+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 7c0
> 2016-11-25T10:43:39.638403+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 7c0
> 2016-11-25T10:43:39.641012+01:00 nfs2 crmd[7661]: error: crmd_fast_exit:
> Could not recover from internal error
> 2016-11-25T10:43:39.649180+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 7c4
> 2016-11-25T10:43:39.649926+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 7c4
> 2016-11-25T10:43:39.651809+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 7c9
> 2016-11-25T10:43:39.652751+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 7c9
> 2016-11-25T10:43:39.659130+01:00 nfs2 pacemakerd[30267]: error:
> pcmk_child_exit: Child process crmd (7661) exited: Generic Pacemaker error
> (201)
> 2016-11-25T10:43:39.660663+01:00 nfs2 pacemakerd[30267]: notice:
> pcmk_process_exit: Respawning failed child process: crmd
> 2016-11-25T10:43:39.661114+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 7ca
> 2016-11-25T10:43:39.662825+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 7cb
> 2016-11-25T10:43:39.672065+01:00 nfs2 crmd[29609]: notice: crm_add_logfile:
> Additional logging available in /var/log/pacemaker.log
> 2016-11-25T10:43:39.673427+01:00 nfs2 crmd[29609]: notice: main: CRM Git
> Version: 1.1.12.git20140904.266d5c2
> 2016-11-25T10:43:39.684597+01:00 nfs2 crmd[29609]: notice:
> crm_cluster_connect: Connecting to cluster infrastructure: corosync
> 2016-11-25T10:43:39.703718+01:00 nfs2 crmd[29609]: notice: get_node_name:
> Could not obtain a node name for corosync nodeid 168230914
> 2016-11-25T10:43:39.713944+01:00 nfs2 crmd[29609]: notice: get_node_name:
> Defaulting to uname -n for the local corosync node name
> 2016-11-25T10:43:39.724509+01:00 nfs2 stonithd[30270]: notice:
> can_fence_host_with_device: fence_myself can fence (reboot) nfs2: static-list
> 2016-11-25T10:43:39.725039+01:00 nfs2 stonithd[30270]: notice:
> can_fence_host_with_device: fence_ilo_nfs2 can fence (reboot) nfs2:
> static-list
> 2016-11-25T10:43:39.736032+01:00 nfs2 crmd[29609]: notice:
> cluster_connect_quorum: Quorum acquired
> 2016-11-25T10:43:39.755308+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 7d1
> 2016-11-25T10:43:39.760087+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 7d1
> 2016-11-25T10:43:39.774955+01:00 nfs2 stonithd[30270]: notice:
> unpack_config: On loss of CCM Quorum: Ignore
> 2016-11-25T10:43:39.775593+01:00 nfs2 crmd[29609]: notice: get_node_name:
> Could not obtain a node name for corosync nodeid 168230913
> 2016-11-25T10:43:39.787102+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 7d7
> 2016-11-25T10:43:39.798237+01:00 nfs2 crmd[29609]: notice: get_node_name:
> Could not obtain a node name for corosync nodeid 168230913
> 2016-11-25T10:43:39.798733+01:00 nfs2 crmd[29609]: notice:
> crm_update_peer_state: pcmk_quorum_notification: Node (null)[168230913] -
> state is now member (was (null))
> 2016-11-25T10:43:39.799152+01:00 nfs2 crmd[29609]: notice:
> crm_update_peer_state: pcmk_quorum_notification: Node nfs2[168230914] - state
> is now member (was (null))
> 2016-11-25T10:43:39.808466+01:00 nfs2 crmd[29609]: notice: get_node_name:
> Defaulting to uname -n for the local corosync node name
> 2016-11-25T10:43:39.808965+01:00 nfs2 crmd[29609]: notice: do_started: The
> local CRM is operational
> 2016-11-25T10:43:39.809379+01:00 nfs2 crmd[29609]: notice:
> do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING
> cause=C_FSA_INTERNAL origin=do_started ]
> 2016-11-25T10:43:39.812583+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 7d8
> 2016-11-25T10:43:40.818555+01:00 nfs2 crmd[29609]: notice: get_node_name:
> Could not obtain a node name for corosync nodeid 168230913
> 2016-11-25T10:43:41.999078+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 7dd
> 2016-11-25T10:43:41.999866+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 7dd
> 2016-11-25T10:43:44.796487+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 7e4
> 2016-11-25T10:43:44.797387+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 7e4
> 2016-11-25T10:44:09.899591+01:00 nfs2 corosync[30260]: [TOTEM ] Marking
> ringid 1 interface 10.x.x.x FAULTY
> 2016-11-25T10:44:09.920643+01:00 nfs2 crmd[29609]: notice:
> do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC
> cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
> 2016-11-25T10:44:10.900436+01:00 nfs2 corosync[30260]: [TOTEM ]
> Automatically recovered ring 1
> 2016-11-25T10:44:14.965750+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 82c
> 2016-11-25T10:44:14.967052+01:00 nfs2 corosync[30260]: [TOTEM ] Retransmit
> List: 82c
> 2016-11-25T10:44:40.821197+01:00 nfs2 stonithd[30270]: notice:
> can_fence_host_with_device: fence_myself can fence (reboot) nfs2: static-list
> 2016-11-25T10:44:40.823821+01:00 nfs2 stonithd[30270]: notice:
> can_fence_host_with_device: fence_ilo_nfs2 can fence (reboot) nfs2:
> static-list
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list