[ClusterLabs] Failover caused by internal error?

Fri Nov 25 06:23:55 EST 2016

Hi,

today we've encountered a FailOver on our NFS Cluster. First suspicion was a hardware outtage. It was not. The failing node has been fenced (reboot). The Failover went as expected. So far so good. But by digging in the Logs of the failed node I found error messages regarding lrmd was not repsonding, crmd could not recover from internal error and generik pacemaker error (201). See logs below.

It seems that one of two corosync rings were flapping. But this shouldn't be the cause for such an behavior?

The Cluster is running on openSUSE 13.2, following packages are installed:

# rpm -qa | grep -Ei "(cluster|pacemaker|coro)"
pacemaker-1.1.12.git20140904.266d5c2-1.5.x86_64
cluster-glue-1.0.12-14.2.1.x86_64
corosync-2.3.4-1.2.x86_64
pacemaker-cts-1.1.12.git20140904.266d5c2-1.5.x86_64
libpacemaker3-1.1.12.git20140904.266d5c2-1.5.x86_64
libcorosync4-2.3.4-1.2.x86_64
pacemaker-cli-1.1.12.git20140904.266d5c2-1.5.x86_64

Storage devices are connected via fibre channel using multipath.

Regards,
Sven

2016-11-25T10:42:49.499255+01:00 nfs2 systemd[1]: Reloading.
2016-11-25T10:42:56.333353+01:00 nfs2 corosync[30260]:   [TOTEM ] Marking ringid 1 interface 10.x.x.x FAULTY
2016-11-25T10:42:57.334657+01:00 nfs2 corosync[30260]:   [TOTEM ] Automatically recovered ring 1
2016-11-25T10:43:39.507268+01:00 nfs2 crmd[7661]:   notice: process_lrm_event: Operation NFS-Server_monitor_30000: unknown error (node=nfs2, call=103, rc=1, cib-update=54, confirmed=false)
2016-11-25T10:43:39.521944+01:00 nfs2 crmd[7661]:    error: crm_ipc_read: Connection to lrmd failed
2016-11-25T10:43:39.524644+01:00 nfs2 crmd[7661]:    error: mainloop_gio_callback: Connection to lrmd[0x1128200] closed (I/O condition=17)
2016-11-25T10:43:39.525093+01:00 nfs2 pacemakerd[30267]:    error: pcmk_child_exit: Child process lrmd (7660) exited: Operation not permitted (1)
2016-11-25T10:43:39.525554+01:00 nfs2 pacemakerd[30267]:   notice: pcmk_process_exit: Respawning failed child process: lrmd
2016-11-25T10:43:39.525956+01:00 nfs2 crmd[7661]:     crit: lrm_connection_destroy: LRM Connection failed
2016-11-25T10:43:39.526383+01:00 nfs2 crmd[7661]:    error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
2016-11-25T10:43:39.526784+01:00 nfs2 crmd[7661]:   notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
2016-11-25T10:43:39.527186+01:00 nfs2 crmd[7661]:  warning: do_recover: Fast-tracking shutdown in response to errors
2016-11-25T10:43:39.527569+01:00 nfs2 crmd[7661]:    error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
2016-11-25T10:43:39.527952+01:00 nfs2 crmd[7661]:    error: lrm_state_verify_stopped: 1 resources were active at shutdown.
2016-11-25T10:43:39.528330+01:00 nfs2 crmd[7661]:   notice: do_lrm_control: Disconnected from the LRM
2016-11-25T10:43:39.528732+01:00 nfs2 crmd[7661]:   notice: terminate_cs_connection: Disconnecting from Corosync
2016-11-25T10:43:39.547847+01:00 nfs2 lrmd[29607]:   notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
2016-11-25T10:43:39.637693+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 7c0
2016-11-25T10:43:39.638403+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 7c0
2016-11-25T10:43:39.641012+01:00 nfs2 crmd[7661]:    error: crmd_fast_exit: Could not recover from internal error
2016-11-25T10:43:39.649180+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 7c4
2016-11-25T10:43:39.649926+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 7c4
2016-11-25T10:43:39.651809+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 7c9
2016-11-25T10:43:39.652751+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 7c9
2016-11-25T10:43:39.659130+01:00 nfs2 pacemakerd[30267]:    error: pcmk_child_exit: Child process crmd (7661) exited: Generic Pacemaker error (201)
2016-11-25T10:43:39.660663+01:00 nfs2 pacemakerd[30267]:   notice: pcmk_process_exit: Respawning failed child process: crmd
2016-11-25T10:43:39.661114+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 7ca
2016-11-25T10:43:39.662825+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 7cb
2016-11-25T10:43:39.672065+01:00 nfs2 crmd[29609]:   notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
2016-11-25T10:43:39.673427+01:00 nfs2 crmd[29609]:   notice: main: CRM Git Version: 1.1.12.git20140904.266d5c2
2016-11-25T10:43:39.684597+01:00 nfs2 crmd[29609]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
2016-11-25T10:43:39.703718+01:00 nfs2 crmd[29609]:   notice: get_node_name: Could not obtain a node name for corosync nodeid 168230914
2016-11-25T10:43:39.713944+01:00 nfs2 crmd[29609]:   notice: get_node_name: Defaulting to uname -n for the local corosync node name
2016-11-25T10:43:39.724509+01:00 nfs2 stonithd[30270]:   notice: can_fence_host_with_device: fence_myself can fence (reboot) nfs2: static-list
2016-11-25T10:43:39.725039+01:00 nfs2 stonithd[30270]:   notice: can_fence_host_with_device: fence_ilo_nfs2 can fence (reboot) nfs2: static-list
2016-11-25T10:43:39.736032+01:00 nfs2 crmd[29609]:   notice: cluster_connect_quorum: Quorum acquired
2016-11-25T10:43:39.755308+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 7d1
2016-11-25T10:43:39.760087+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 7d1
2016-11-25T10:43:39.774955+01:00 nfs2 stonithd[30270]:   notice: unpack_config: On loss of CCM Quorum: Ignore
2016-11-25T10:43:39.775593+01:00 nfs2 crmd[29609]:   notice: get_node_name: Could not obtain a node name for corosync nodeid 168230913
2016-11-25T10:43:39.787102+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 7d7
2016-11-25T10:43:39.798237+01:00 nfs2 crmd[29609]:   notice: get_node_name: Could not obtain a node name for corosync nodeid 168230913
2016-11-25T10:43:39.798733+01:00 nfs2 crmd[29609]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[168230913] - state is now member (was (null))
2016-11-25T10:43:39.799152+01:00 nfs2 crmd[29609]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node nfs2[168230914] - state is now member (was (null))
2016-11-25T10:43:39.808466+01:00 nfs2 crmd[29609]:   notice: get_node_name: Defaulting to uname -n for the local corosync node name
2016-11-25T10:43:39.808965+01:00 nfs2 crmd[29609]:   notice: do_started: The local CRM is operational
2016-11-25T10:43:39.809379+01:00 nfs2 crmd[29609]:   notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
2016-11-25T10:43:39.812583+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 7d8
2016-11-25T10:43:40.818555+01:00 nfs2 crmd[29609]:   notice: get_node_name: Could not obtain a node name for corosync nodeid 168230913
2016-11-25T10:43:41.999078+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 7dd
2016-11-25T10:43:41.999866+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 7dd
2016-11-25T10:43:44.796487+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 7e4
2016-11-25T10:43:44.797387+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 7e4
2016-11-25T10:44:09.899591+01:00 nfs2 corosync[30260]:   [TOTEM ] Marking ringid 1 interface 10.x.x.x FAULTY
2016-11-25T10:44:09.920643+01:00 nfs2 crmd[29609]:   notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
2016-11-25T10:44:10.900436+01:00 nfs2 corosync[30260]:   [TOTEM ] Automatically recovered ring 1
2016-11-25T10:44:14.965750+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 82c
2016-11-25T10:44:14.967052+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit List: 82c
2016-11-25T10:44:40.821197+01:00 nfs2 stonithd[30270]:   notice: can_fence_host_with_device: fence_myself can fence (reboot) nfs2: static-list
2016-11-25T10:44:40.823821+01:00 nfs2 stonithd[30270]:   notice: can_fence_host_with_device: fence_ilo_nfs2 can fence (reboot) nfs2: static-list