[ClusterLabs] node is always offline

Tue Aug 16 03:45:21 UTC 2016

Maybe you should attach pacemaker log file (/var/log/pacemaker.log).
BTW, is your network running well?

在 2016-08-16二的 09:35 +0800，刘明写道：
> Hi all,
> I am using pacemaker/corosync and iscsi to have a high-available
> server.
> At the beginning it is very good, but two days ago there is some
> error.
> 
> When started one node, it's always offline.
> 
> Last updated: Mon Aug 15 17:31:54 2016
> Last change: Mon Aug 15 16:34:30 2016 via crmd on node0
> Current DC: NONE
> 1 Nodes configured
> 0 Resources configured
> 
> Node node0 (1): UNCLEAN (offline)
> 
> In the log /var/log/message:
> Aug 15 09:25:04 node0 kernel: connection1:0: detected conn error
> (1020)
> Aug 15 09:25:04 node0 iscsid: Kernel reported iSCSI connection 1:0
> error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) 
> 
> state (3)
> Aug 15 09:25:07 node0 iscsid: connection1:0 is operational after
> recovery (1 attempts)
> Aug 15 09:25:09 node0 kernel: connection1:0: detected conn error
> (1020)
> Aug 15 09:25:10 node0 iscsid: Kernel reported iSCSI connection 1:0
> error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) 
> 
> state (3)
> Aug 15 09:25:12 node0 iscsid: connection1:0 is operational after
> recovery (1 attempts)
> Aug 15 09:25:15 node0 kernel: connection1:0: detected conn error
> (1020)
> Aug 15 09:25:15 node0 iscsid: Kernel reported iSCSI connection 1:0
> error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) 
> 
> state (3)
> Aug 15 09:25:18 node0 iscsid: connection1:0 is operational after
> recovery (1 attempts)
> Aug 15 09:25:20 node0 kernel: connection1:0: detected conn error
> (1020)
> Aug 15 09:25:20 node0 iscsid: Kernel reported iSCSI connection 1:0
> error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) 
> 
> state (3)
> Aug 15 09:25:23 node0 iscsid: connection1:0 is operational after
> recovery (1 attempts)
> 
> That looks like a iscsi error. Then I stop iscsi, and restart
> corosync, the node is still offline as before, and the log is as 
> 
> follows:
> 
> Aug 15 17:32:04 node0 crmd[7208]: notice: lrm_state_verify_stopped:
> Stopped 0 recurring operations at shutdown (0 ops remaining)
> Aug 15 17:32:04 node0 crmd[7208]: notice: do_lrm_control:
> Disconnected from the LRM
> Aug 15 17:32:04 node0 crmd[7208]: notice: terminate_cs_connection:
> Disconnecting from Corosync
> Aug 15 17:32:04 node0 crmd[7208]: error: crmd_fast_exit: Could not
> recover from internal error
> Aug 15 17:32:04 node0 pacemakerd[7100]: error: pcmk_child_exit: Child
> process crmd (7208) exited: Generic Pacemaker error (201)
> Aug 15 17:32:04 node0 pacemakerd[7100]: notice: pcmk_process_exit:
> Respawning failed child process: crmd
> Aug 15 17:32:04 node0 crmd[7209]: notice: crm_add_logfile: Additional
> logging available in /var/log/pacemaker.log
> Aug 15 17:32:04 node0 crmd[7209]: notice: main: CRM Git Version:
> 368c726
> Aug 15 17:32:05 node0 crmd[7209]: notice: crm_cluster_connect:
> Connecting to cluster infrastructure: corosync
> Aug 15 17:32:05 node0 crmd[7209]: notice: cluster_connect_quorum:
> Quorum acquired
> Aug 15 17:32:05 node0 crmd[7209]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node node0[1] - state is now member 
> 
> (was (null))
> Aug 15 17:32:05 node0 crmd[7209]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node node0[1] - state is now lost (was 
> 
> member)
> Aug 15 17:32:05 node0 crmd[7209]: error: reap_dead_nodes: We're not
> part of the cluster anymore
> Aug 15 17:32:05 node0 crmd[7209]: error: do_log: FSA: Input I_ERROR
> from reap_dead_nodes() received in state S_STARTING
> Aug 15 17:32:05 node0 crmd[7209]: notice: do_state_transition: State
> transition S_STARTING -> S_RECOVERY [ input=I_ERROR 
> 
> cause=C_FSA_INTERNAL origin=reap_dead_nodes ]
> Aug 15 17:32:05 node0 crmd[7209]: warning: do_recover: Fast-tracking
> shutdown in response to errors
> Aug 15 17:32:05 node0 crmd[7209]: error: do_started: Start
> cancelled... S_RECOVERY
> Aug 15 17:32:05 node0 crmd[7209]: error: do_log: FSA: Input
> I_TERMINATE from do_recover() received in state S_RECOVERY
> Aug 15 17:32:05 node0 crmd[7209]: notice: lrm_state_verify_stopped:
> Stopped 0 recurring operations at shutdown (0 ops remaining)
> Aug 15 17:32:05 node0 crmd[7209]: notice: do_lrm_control:
> Disconnected from the LRM
> Aug 15 17:32:05 node0 crmd[7209]: notice: terminate_cs_connection:
> Disconnecting from Corosync
> Aug 15 17:32:05 node0 crmd[7209]: error: crmd_fast_exit: Could not
> recover from internal error
> Aug 15 17:32:05 node0 pacemakerd[7100]: error: pcmk_child_exit: Child
> process crmd (7209) exited: Generic Pacemaker error (201)
> Aug 15 17:32:05 node0 pacemakerd[7100]: error: pcmk_process_exit:
> Child respawn count exceeded by crmd
> 
> 
>  
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org