[ClusterLabs] Antw: Failover caused by internal error?

Fri Nov 25 12:53:49 CET 2016

Hi!

We have these retransmit lists with multipath also. I suspect that multicast communication is broken. Since moving to udpu at lease those messages are gone. In your case I suspect it was high network load that triggered the problems.

Ulrich

>>> "Sven Moeller" <smoeller at nichthelfer.de> schrieb am 25.11.2016 um 12:23 in
Nachricht <2b1-58381f80-b-377afb00 at 69899261>:
> Hi,
> 
> today we've encountered a FailOver on our NFS Cluster. First suspicion was a 
> hardware outtage. It was not. The failing node has been fenced (reboot). The 
> Failover went as expected. So far so good. But by digging in the Logs of the 
> failed node I found error messages regarding lrmd was not repsonding, crmd 
> could not recover from internal error and generik pacemaker error (201). See 
> logs below.
> 
> It seems that one of two corosync rings were flapping. But this shouldn't be 
> the cause for such an behavior?
> 
> The Cluster is running on openSUSE 13.2, following packages are installed:
> 
> # rpm -qa | grep -Ei "(cluster|pacemaker|coro)"
> pacemaker-1.1.12.git20140904.266d5c2-1.5.x86_64
> cluster-glue-1.0.12-14.2.1.x86_64
> corosync-2.3.4-1.2.x86_64
> pacemaker-cts-1.1.12.git20140904.266d5c2-1.5.x86_64
> libpacemaker3-1.1.12.git20140904.266d5c2-1.5.x86_64
> libcorosync4-2.3.4-1.2.x86_64
> pacemaker-cli-1.1.12.git20140904.266d5c2-1.5.x86_64
> 
> Storage devices are connected via fibre channel using multipath.
> 
> Regards,
> Sven
> 
> 2016-11-25T10:42:49.499255+01:00 nfs2 systemd[1]: Reloading.
> 2016-11-25T10:42:56.333353+01:00 nfs2 corosync[30260]:   [TOTEM ] Marking 
> ringid 1 interface 10.x.x.x FAULTY
> 2016-11-25T10:42:57.334657+01:00 nfs2 corosync[30260]:   [TOTEM ] 
> Automatically recovered ring 1
> 2016-11-25T10:43:39.507268+01:00 nfs2 crmd[7661]:   notice: process_lrm_event: 
> Operation NFS-Server_monitor_30000: unknown error (node=nfs2, call=103, rc=1, 
> cib-update=54, confirmed=false)
> 2016-11-25T10:43:39.521944+01:00 nfs2 crmd[7661]:    error: crm_ipc_read: 
> Connection to lrmd failed
> 2016-11-25T10:43:39.524644+01:00 nfs2 crmd[7661]:    error: 
> mainloop_gio_callback: Connection to lrmd[0x1128200] closed (I/O 
> condition=17)
> 2016-11-25T10:43:39.525093+01:00 nfs2 pacemakerd[30267]:    error: 
> pcmk_child_exit: Child process lrmd (7660) exited: Operation not permitted 
> (1)
> 2016-11-25T10:43:39.525554+01:00 nfs2 pacemakerd[30267]:   notice: 
> pcmk_process_exit: Respawning failed child process: lrmd
> 2016-11-25T10:43:39.525956+01:00 nfs2 crmd[7661]:     crit: 
> lrm_connection_destroy: LRM Connection failed
> 2016-11-25T10:43:39.526383+01:00 nfs2 crmd[7661]:    error: do_log: FSA: Input 
> I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
> 2016-11-25T10:43:39.526784+01:00 nfs2 crmd[7661]:   notice: 
> do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR 
> cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
> 2016-11-25T10:43:39.527186+01:00 nfs2 crmd[7661]:  warning: do_recover: 
> Fast-tracking shutdown in response to errors
> 2016-11-25T10:43:39.527569+01:00 nfs2 crmd[7661]:    error: do_log: FSA: Input 
> I_TERMINATE from do_recover() received in state S_RECOVERY
> 2016-11-25T10:43:39.527952+01:00 nfs2 crmd[7661]:    error: 
> lrm_state_verify_stopped: 1 resources were active at shutdown.
> 2016-11-25T10:43:39.528330+01:00 nfs2 crmd[7661]:   notice: do_lrm_control: 
> Disconnected from the LRM
> 2016-11-25T10:43:39.528732+01:00 nfs2 crmd[7661]:   notice: 
> terminate_cs_connection: Disconnecting from Corosync
> 2016-11-25T10:43:39.547847+01:00 nfs2 lrmd[29607]:   notice: crm_add_logfile: 
> Additional logging available in /var/log/pacemaker.log
> 2016-11-25T10:43:39.637693+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 7c0
> 2016-11-25T10:43:39.638403+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 7c0
> 2016-11-25T10:43:39.641012+01:00 nfs2 crmd[7661]:    error: crmd_fast_exit: 
> Could not recover from internal error
> 2016-11-25T10:43:39.649180+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 7c4
> 2016-11-25T10:43:39.649926+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 7c4
> 2016-11-25T10:43:39.651809+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 7c9
> 2016-11-25T10:43:39.652751+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 7c9
> 2016-11-25T10:43:39.659130+01:00 nfs2 pacemakerd[30267]:    error: 
> pcmk_child_exit: Child process crmd (7661) exited: Generic Pacemaker error 
> (201)
> 2016-11-25T10:43:39.660663+01:00 nfs2 pacemakerd[30267]:   notice: 
> pcmk_process_exit: Respawning failed child process: crmd
> 2016-11-25T10:43:39.661114+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 7ca
> 2016-11-25T10:43:39.662825+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 7cb
> 2016-11-25T10:43:39.672065+01:00 nfs2 crmd[29609]:   notice: crm_add_logfile: 
> Additional logging available in /var/log/pacemaker.log
> 2016-11-25T10:43:39.673427+01:00 nfs2 crmd[29609]:   notice: main: CRM Git 
> Version: 1.1.12.git20140904.266d5c2
> 2016-11-25T10:43:39.684597+01:00 nfs2 crmd[29609]:   notice: 
> crm_cluster_connect: Connecting to cluster infrastructure: corosync
> 2016-11-25T10:43:39.703718+01:00 nfs2 crmd[29609]:   notice: get_node_name: 
> Could not obtain a node name for corosync nodeid 168230914
> 2016-11-25T10:43:39.713944+01:00 nfs2 crmd[29609]:   notice: get_node_name: 
> Defaulting to uname -n for the local corosync node name
> 2016-11-25T10:43:39.724509+01:00 nfs2 stonithd[30270]:   notice: 
> can_fence_host_with_device: fence_myself can fence (reboot) nfs2: static-list
> 2016-11-25T10:43:39.725039+01:00 nfs2 stonithd[30270]:   notice: 
> can_fence_host_with_device: fence_ilo_nfs2 can fence (reboot) nfs2: 
> static-list
> 2016-11-25T10:43:39.736032+01:00 nfs2 crmd[29609]:   notice: 
> cluster_connect_quorum: Quorum acquired
> 2016-11-25T10:43:39.755308+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 7d1
> 2016-11-25T10:43:39.760087+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 7d1
> 2016-11-25T10:43:39.774955+01:00 nfs2 stonithd[30270]:   notice: 
> unpack_config: On loss of CCM Quorum: Ignore
> 2016-11-25T10:43:39.775593+01:00 nfs2 crmd[29609]:   notice: get_node_name: 
> Could not obtain a node name for corosync nodeid 168230913
> 2016-11-25T10:43:39.787102+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 7d7
> 2016-11-25T10:43:39.798237+01:00 nfs2 crmd[29609]:   notice: get_node_name: 
> Could not obtain a node name for corosync nodeid 168230913
> 2016-11-25T10:43:39.798733+01:00 nfs2 crmd[29609]:   notice: 
> crm_update_peer_state: pcmk_quorum_notification: Node (null)[168230913] - 
> state is now member (was (null))
> 2016-11-25T10:43:39.799152+01:00 nfs2 crmd[29609]:   notice: 
> crm_update_peer_state: pcmk_quorum_notification: Node nfs2[168230914] - state 
> is now member (was (null))
> 2016-11-25T10:43:39.808466+01:00 nfs2 crmd[29609]:   notice: get_node_name: 
> Defaulting to uname -n for the local corosync node name
> 2016-11-25T10:43:39.808965+01:00 nfs2 crmd[29609]:   notice: do_started: The 
> local CRM is operational
> 2016-11-25T10:43:39.809379+01:00 nfs2 crmd[29609]:   notice: 
> do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING 
> cause=C_FSA_INTERNAL origin=do_started ]
> 2016-11-25T10:43:39.812583+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 7d8
> 2016-11-25T10:43:40.818555+01:00 nfs2 crmd[29609]:   notice: get_node_name: 
> Could not obtain a node name for corosync nodeid 168230913
> 2016-11-25T10:43:41.999078+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 7dd
> 2016-11-25T10:43:41.999866+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 7dd
> 2016-11-25T10:43:44.796487+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 7e4
> 2016-11-25T10:43:44.797387+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 7e4
> 2016-11-25T10:44:09.899591+01:00 nfs2 corosync[30260]:   [TOTEM ] Marking 
> ringid 1 interface 10.x.x.x FAULTY
> 2016-11-25T10:44:09.920643+01:00 nfs2 crmd[29609]:   notice: 
> do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC 
> cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
> 2016-11-25T10:44:10.900436+01:00 nfs2 corosync[30260]:   [TOTEM ] 
> Automatically recovered ring 1
> 2016-11-25T10:44:14.965750+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 82c
> 2016-11-25T10:44:14.967052+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
> List: 82c
> 2016-11-25T10:44:40.821197+01:00 nfs2 stonithd[30270]:   notice: 
> can_fence_host_with_device: fence_myself can fence (reboot) nfs2: static-list
> 2016-11-25T10:44:40.823821+01:00 nfs2 stonithd[30270]:   notice: 
> can_fence_host_with_device: fence_ilo_nfs2 can fence (reboot) nfs2: 
> static-list
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org