[ClusterLabs] node is always offline

Tue Aug 16 15:07:20 UTC 2016

On 08/15/2016 08:35 PM, 刘明 wrote:
> Hi all,
> I am using pacemaker/corosync and iscsi to have a high-available server.
> At the beginning it is very good, but two days ago there is some error.
> 
> When started one node, it's always offline.
> 
> Last updated: Mon Aug 15 17:31:54 2016
> Last change: Mon Aug 15 16:34:30 2016 via crmd on node0
> Current DC: NONE
> 1 Nodes configured
> 0 Resources configured
> 
> Node node0 (1): UNCLEAN (offline)
> 
> In the log /var/log/message:
> Aug 15 09:25:04 node0 kernel: connection1:0: detected conn error (1020)
> Aug 15 09:25:04 node0 iscsid: Kernel reported iSCSI connection 1:0 error
> (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed)
> 
> state (3)
> Aug 15 09:25:07 node0 iscsid: connection1:0 is operational after
> recovery (1 attempts)
> Aug 15 09:25:09 node0 kernel: connection1:0: detected conn error (1020)
> Aug 15 09:25:10 node0 iscsid: Kernel reported iSCSI connection 1:0 error
> (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed)
> 
> state (3)
> Aug 15 09:25:12 node0 iscsid: connection1:0 is operational after
> recovery (1 attempts)
> Aug 15 09:25:15 node0 kernel: connection1:0: detected conn error (1020)
> Aug 15 09:25:15 node0 iscsid: Kernel reported iSCSI connection 1:0 error
> (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed)
> 
> state (3)
> Aug 15 09:25:18 node0 iscsid: connection1:0 is operational after
> recovery (1 attempts)
> Aug 15 09:25:20 node0 kernel: connection1:0: detected conn error (1020)
> Aug 15 09:25:20 node0 iscsid: Kernel reported iSCSI connection 1:0 error
> (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed)
> 
> state (3)
> Aug 15 09:25:23 node0 iscsid: connection1:0 is operational after
> recovery (1 attempts)

I'm not very familiar with iSCSI, but I believe the usual setup is to
have the cluster manage the iSCSI as cluster resources. So, there should
be no iSCSI activity at boot time (the iSCSI errors above combined with
0 resources in the cluster suggest that there is).

See the ocf:heartbeat:iSCSILogicalUnit and iSCSITarget resource agents.

> 
> That looks like a iscsi error. Then I stop iscsi, and restart corosync,
> the node is still offline as before, and the log is as
> 
> follows:
> 
> Aug 15 17:32:04 node0 crmd[7208]: notice: lrm_state_verify_stopped:
> Stopped 0 recurring operations at shutdown (0 ops remaining)
> Aug 15 17:32:04 node0 crmd[7208]: notice: do_lrm_control: Disconnected
> from the LRM
> Aug 15 17:32:04 node0 crmd[7208]: notice: terminate_cs_connection:
> Disconnecting from Corosync

Did you restart corosync while pacemaker was still running? Pacemaker
can't recover from that, so I'm guessing that why you're seeing these
messages.

My suggestion would be to stop the cluster, get the iSCSI working
without any errors at boot, then disable iSCSI at boot and add the iSCSI
resources to the cluster.

> Aug 15 17:32:04 node0 crmd[7208]: error: crmd_fast_exit: Could not
> recover from internal error
> Aug 15 17:32:04 node0 pacemakerd[7100]: error: pcmk_child_exit: Child
> process crmd (7208) exited: Generic Pacemaker error (201)
> Aug 15 17:32:04 node0 pacemakerd[7100]: notice: pcmk_process_exit:
> Respawning failed child process: crmd
> Aug 15 17:32:04 node0 crmd[7209]: notice: crm_add_logfile: Additional
> logging available in /var/log/pacemaker.log
> Aug 15 17:32:04 node0 crmd[7209]: notice: main: CRM Git Version: 368c726
> Aug 15 17:32:05 node0 crmd[7209]: notice: crm_cluster_connect:
> Connecting to cluster infrastructure: corosync
> Aug 15 17:32:05 node0 crmd[7209]: notice: cluster_connect_quorum: Quorum
> acquired
> Aug 15 17:32:05 node0 crmd[7209]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node node0[1] - state is now member
> 
> (was (null))
> Aug 15 17:32:05 node0 crmd[7209]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node node0[1] - state is now lost (was
> 
> member)
> Aug 15 17:32:05 node0 crmd[7209]: error: reap_dead_nodes: We're not part
> of the cluster anymore
> Aug 15 17:32:05 node0 crmd[7209]: error: do_log: FSA: Input I_ERROR from
> reap_dead_nodes() received in state S_STARTING
> Aug 15 17:32:05 node0 crmd[7209]: notice: do_state_transition: State
> transition S_STARTING -> S_RECOVERY [ input=I_ERROR
> 
> cause=C_FSA_INTERNAL origin=reap_dead_nodes ]
> Aug 15 17:32:05 node0 crmd[7209]: warning: do_recover: Fast-tracking
> shutdown in response to errors
> Aug 15 17:32:05 node0 crmd[7209]: error: do_started: Start cancelled...
> S_RECOVERY
> Aug 15 17:32:05 node0 crmd[7209]: error: do_log: FSA: Input I_TERMINATE
> from do_recover() received in state S_RECOVERY
> Aug 15 17:32:05 node0 crmd[7209]: notice: lrm_state_verify_stopped:
> Stopped 0 recurring operations at shutdown (0 ops remaining)
> Aug 15 17:32:05 node0 crmd[7209]: notice: do_lrm_control: Disconnected
> from the LRM
> Aug 15 17:32:05 node0 crmd[7209]: notice: terminate_cs_connection:
> Disconnecting from Corosync
> Aug 15 17:32:05 node0 crmd[7209]: error: crmd_fast_exit: Could not
> recover from internal error
> Aug 15 17:32:05 node0 pacemakerd[7100]: error: pcmk_child_exit: Child
> process crmd (7209) exited: Generic Pacemaker error (201)
> Aug 15 17:32:05 node0 pacemakerd[7100]: error: pcmk_process_exit: Child
> respawn count exceeded by crmd