[ClusterLabs] node is always offline

Tue Aug 16 01:35:17 UTC 2016

Hi all,
I am using pacemaker/corosync and iscsi to have a high-available server.
At the beginning it is very good, but two days ago there is some error.

When started one node, it's always offline.

Last updated: Mon Aug 15 17:31:54 2016
Last change: Mon Aug 15 16:34:30 2016 via crmd on node0
Current DC: NONE
1 Nodes configured
0 Resources configured

Node node0 (1): UNCLEAN (offline)

In the log /var/log/message:
Aug 15 09:25:04 node0 kernel: connection1:0: detected conn error (1020)
Aug 15 09:25:04 node0 iscsid: Kernel reported iSCSI connection 1:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed)

state (3)
Aug 15 09:25:07 node0 iscsid: connection1:0 is operational after recovery (1 attempts)
Aug 15 09:25:09 node0 kernel: connection1:0: detected conn error (1020)
Aug 15 09:25:10 node0 iscsid: Kernel reported iSCSI connection 1:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed)

state (3)
Aug 15 09:25:12 node0 iscsid: connection1:0 is operational after recovery (1 attempts)
Aug 15 09:25:15 node0 kernel: connection1:0: detected conn error (1020)
Aug 15 09:25:15 node0 iscsid: Kernel reported iSCSI connection 1:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed)

state (3)
Aug 15 09:25:18 node0 iscsid: connection1:0 is operational after recovery (1 attempts)
Aug 15 09:25:20 node0 kernel: connection1:0: detected conn error (1020)
Aug 15 09:25:20 node0 iscsid: Kernel reported iSCSI connection 1:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed)

state (3)
Aug 15 09:25:23 node0 iscsid: connection1:0 is operational after recovery (1 attempts)

That looks like a iscsi error. Then I stop iscsi, and restart corosync, the node is still offline as before, and the log is as

follows:

Aug 15 17:32:04 node0 crmd[7208]: notice: lrm_state_verify_stopped: Stopped 0 recurring operations at shutdown (0 ops remaining)
Aug 15 17:32:04 node0 crmd[7208]: notice: do_lrm_control: Disconnected from the LRM
Aug 15 17:32:04 node0 crmd[7208]: notice: terminate_cs_connection: Disconnecting from Corosync
Aug 15 17:32:04 node0 crmd[7208]: error: crmd_fast_exit: Could not recover from internal error
Aug 15 17:32:04 node0 pacemakerd[7100]: error: pcmk_child_exit: Child process crmd (7208) exited: Generic Pacemaker error (201)
Aug 15 17:32:04 node0 pacemakerd[7100]: notice: pcmk_process_exit: Respawning failed child process: crmd
Aug 15 17:32:04 node0 crmd[7209]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Aug 15 17:32:04 node0 crmd[7209]: notice: main: CRM Git Version: 368c726
Aug 15 17:32:05 node0 crmd[7209]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Aug 15 17:32:05 node0 crmd[7209]: notice: cluster_connect_quorum: Quorum acquired
Aug 15 17:32:05 node0 crmd[7209]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now member

(was (null))
Aug 15 17:32:05 node0 crmd[7209]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node0[1] - state is now lost (was

member)
Aug 15 17:32:05 node0 crmd[7209]: error: reap_dead_nodes: We're not part of the cluster anymore
Aug 15 17:32:05 node0 crmd[7209]: error: do_log: FSA: Input I_ERROR from reap_dead_nodes() received in state S_STARTING
Aug 15 17:32:05 node0 crmd[7209]: notice: do_state_transition: State transition S_STARTING -> S_RECOVERY [ input=I_ERROR

cause=C_FSA_INTERNAL origin=reap_dead_nodes ]
Aug 15 17:32:05 node0 crmd[7209]: warning: do_recover: Fast-tracking shutdown in response to errors
Aug 15 17:32:05 node0 crmd[7209]: error: do_started: Start cancelled... S_RECOVERY
Aug 15 17:32:05 node0 crmd[7209]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Aug 15 17:32:05 node0 crmd[7209]: notice: lrm_state_verify_stopped: Stopped 0 recurring operations at shutdown (0 ops remaining)
Aug 15 17:32:05 node0 crmd[7209]: notice: do_lrm_control: Disconnected from the LRM
Aug 15 17:32:05 node0 crmd[7209]: notice: terminate_cs_connection: Disconnecting from Corosync
Aug 15 17:32:05 node0 crmd[7209]: error: crmd_fast_exit: Could not recover from internal error
Aug 15 17:32:05 node0 pacemakerd[7100]: error: pcmk_child_exit: Child process crmd (7209) exited: Generic Pacemaker error (201)
Aug 15 17:32:05 node0 pacemakerd[7100]: error: pcmk_process_exit: Child respawn count exceeded by crmd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160816/4cccc79c/attachment-0003.html>