[ClusterLabs] corosync service not automatically started
Václav Mach
machv at cesnet.cz
Tue Oct 10 04:35:17 EDT 2017
Hello,
I'm working on a two node cluster. The nodes are r1nren (r1) and r2nren
(r2). There are some resources at the moment, but I think it's not
important for this problem.
Both nodes are virtual servers running on vmware. Both nodes are running
debian strech, I'm using corosync and pacemaker for the cluster.
Complete list of used version below:
root at r2nren:~# uname -a
Linux r2nren.et.cesnet.cz 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u3
(2017-08-06) x86_64 GNU/Linux
root at r2nren:~# dpkg -l | grep corosync
ii corosync 2.4.2-3
amd64 cluster engine daemon and utilities
ii libcorosync-common4:amd64 2.4.2-3
amd64 cluster engine common library
root at r2nren:~# dpkg -l | grep pacemaker
ii crmsh 2.3.2-4 all
CRM shell for the pacemaker cluster manager
ii pacemaker 1.1.16-1
amd64 cluster resource manager
ii pacemaker-cli-utils 1.1.16-1
amd64 cluster resource manager command line utilities
ii pacemaker-common 1.1.16-1 all
cluster resource manager common files
ii pacemaker-resource-agents 1.1.16-1 all
cluster resource manager general resource agents
root at r1nren:~# uname -a
Linux r1nren.et.cesnet.cz 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u3
(2017-08-06) x86_64 GNU/Linux
root at r1nren:~# dpkg -l | grep corosync
ii corosync 2.4.2-3
amd64 cluster engine daemon and utilities
ii libcorosync-common4:amd64 2.4.2-3
amd64 cluster engine common library
root at r1nren:~# dpkg -l | grep pacemaker
ii crmsh 2.3.2-4 all
CRM shell for the pacemaker cluster manager
ii pacemaker 1.1.16-1
amd64 cluster resource manager
ii pacemaker-cli-utils 1.1.16-1
amd64 cluster resource manager command line utilities
ii pacemaker-common 1.1.16-1 all
cluster resource manager common files
ii pacemaker-resource-agents 1.1.16-1 all
cluster resource manager general resource agents
When the cluster is operating fine, the state is:
root at r2nren:~# crm status
Stack: corosync
Current DC: r1nren.et.cesnet.cz (version 1.1.16-94ff4df) - partition
with quorum
Last updated: Tue Oct 10 10:12:22 2017
Last change: Mon Oct 9 13:09:59 2017 by root via crm_attribute on
r1nren.et.cesnet.cz
2 nodes configured
8 resources configured
Online: [ r1nren.et.cesnet.cz r2nren.et.cesnet.cz ]
Full list of resources:
Clone Set: clone_ping_gw [ping_gw]
Started: [ r1nren.et.cesnet.cz r2nren.et.cesnet.cz ]
Resource Group: group_eduroam.cz
standby_ip (ocf::heartbeat:IPaddr2): Started r1nren.et.cesnet.cz
offline_file (systemd:offline_file): Started r1nren.et.cesnet.cz
racoon (systemd:racoon): Started r1nren.et.cesnet.cz
radiator (systemd:radiator): Started r1nren.et.cesnet.cz
eduroam_ping (systemd:eduroam_ping): Started r1nren.et.cesnet.cz
mailto (ocf::heartbeat:MailTo): Started r1nren.et.cesnet.cz
I've discovered that if i reboot any of the nodes using just command
'reboot' from terminal or if reboot them from the vmware web interface,
everything performs fine. The node undergoing reboot disconnects from
cluster and reconnects again.
The problem appears when I shutdown (guest os shutdown not force
shutdown) the machine from vmware web interface and start it again. The
machine is unable to join the cluster. Pacemaker and corosync are not
running. The pacemaker says, the it failed on dependency, which is
obviuously corosync.
The corosync says:
root at r1nren:~# crm status
ERROR: status: crm_mon (rc=107): Connection to cluster failed: Transport
endpoint is not connected
root at r1nren:~# service corosync status
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled;
vendor preset: enabled)
Active: failed (Result: signal) since Tue 2017-10-10 10:27:10 CEST;
1min 10s ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 709 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS
(code=killed, signal=ABRT)
Main PID: 709 (code=killed, signal=ABRT)
Oct 10 10:27:05 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied
connection, is not ready (709-1337-18)
Oct 10 10:27:06 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied
connection, is not ready (709-1337-18)
Oct 10 10:27:07 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied
connection, is not ready (709-1337-18)
Oct 10 10:27:08 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied
connection, is not ready (709-1337-18)
Oct 10 10:27:09 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied
connection, is not ready (709-1337-18)
Oct 10 10:27:10 r1nren.et.cesnet.cz corosync[709]: corosync:
votequorum.c:2065: message_handler_req_exec_votequorum_nodeinfo:
Assertion `sender_node != NULL' failed.
Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Main
process exited, code=killed, status=6/ABRT
Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: Failed to start Corosync
Cluster Engine.
Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Unit
entered failed state.
Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Failed
with result 'signal'.
root at r1nren:~# journalctl -u corosync
Oct 10 10:26:58 r1nren.et.cesnet.cz systemd[1]: Starting Corosync
Cluster Engine...
Oct 10 10:26:58 r1nren.et.cesnet.cz corosync[709]: [MAIN ] Corosync
Cluster Engine ('2.4.2'): started and ready to provide service.
Oct 10 10:26:58 r1nren.et.cesnet.cz corosync[709]: [MAIN ] Corosync
built-in features: dbus rdma monitoring watchdog augeas systemd upstart
xmlconf qdevices qnetd snm
Oct 10 10:26:58 r1nren.et.cesnet.cz corosync[709]: [TOTEM ]
Initializing transport (UDP/IP Unicast).
Oct 10 10:26:58 r1nren.et.cesnet.cz corosync[709]: [TOTEM ]
Initializing transmit/receive security (NSS) crypto: aes256 hash: sha256
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] The
network interface is down.
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service
engine loaded: corosync configuration map access [0]
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QB ] server
name: cmap
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service
engine loaded: corosync configuration service [1]
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QB ] server
name: cfg
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service
engine loaded: corosync cluster closed process group service v1.01 [2]
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QB ] server
name: cpg
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service
engine loaded: corosync profile loading service [4]
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service
engine loaded: corosync resource monitoring service [6]
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [WD ] No
Watchdog /dev/watchdog, try modprobe <a watchdog>
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [WD ] resource
load_15min missing a recovery key.
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [WD ] resource
memory_used missing a recovery key.
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [WD ] no
resources configured.
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service
engine loaded: corosync watchdog service [7]
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QUORUM] Using
quorum provider corosync_votequorum
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [VOTEQ ] Waiting
for all cluster members. Current votes: 1 expected_votes: 2
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service
engine loaded: corosync vote quorum service v1.0 [5]
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QB ] server
name: votequorum
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service
engine loaded: corosync cluster quorum service v0.1 [3]
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QB ] server
name: quorum
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] adding new
UDPU member {78.128.211.51}
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] adding new
UDPU member {78.128.211.52}
Oct 10 10:27:00 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied
connection, is not ready (709-1337-16)
Oct 10 10:27:01 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied
connection, is not ready (709-1337-16)
Oct 10 10:27:02 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied
connection, is not ready (709-1337-16)
Oct 10 10:27:03 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] The
network interface [78.128.211.51] is now up.
Oct 10 10:27:03 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] adding new
UDPU member {78.128.211.51}
Oct 10 10:27:03 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] adding new
UDPU member {78.128.211.52}
Oct 10 10:27:03 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied
connection, is not ready (709-1337-18)
Oct 10 10:27:04 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied
connection, is not ready (709-1337-18)
Oct 10 10:27:05 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied
connection, is not ready (709-1337-18)
Oct 10 10:27:06 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied
connection, is not ready (709-1337-18)
Oct 10 10:27:07 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied
connection, is not ready (709-1337-18)
Oct 10 10:27:08 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied
connection, is not ready (709-1337-18)
Oct 10 10:27:09 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied
connection, is not ready (709-1337-18)
Oct 10 10:27:10 r1nren.et.cesnet.cz corosync[709]: corosync:
votequorum.c:2065: message_handler_req_exec_votequorum_nodeinfo:
Assertion `sender_node != NULL' failed.
Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Main
process exited, code=killed, status=6/ABRT
Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: Failed to start Corosync
Cluster Engine.
Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Unit
entered failed state.
Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Failed
with result 'signal'.
corosync configuration:
root at r1nren:~# cat /etc/corosync/corosync.conf
totem {
version: 2
transport: udpu
cluster_name: eduroam.cz
token: 3000
token_retransmits_before_loss_const: 10
clear_node_high_bit: yes
crypto_cipher: aes256
crypto_hash: sha256
interface {
ringnumber: 0
bindnetaddr: 78.128.211.51
ttl: 1
}
}
logging {
fileline: off
to_stderr: no
to_logfile: no
to_syslog: yes
syslog_facility: daemon
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}
quorum {
provider: corosync_votequorum
expected_votes: 2
two_node: 1
}
nodelist{
node {
ring0_addr: 78.128.211.51
}
node {
ring0_addr: 78.128.211.52
}
}
Let me know if I can provide any more information about this (are there
any corosync logs?).
View from r2:
root at r2nren:~# crm status
Stack: corosync
Current DC: r2nren.et.cesnet.cz (version 1.1.16-94ff4df) - partition
with quorum
Last updated: Tue Oct 10 10:29:45 2017
Last change: Tue Oct 10 10:25:32 2017 by root via crm_attribute on
r1nren.et.cesnet.cz
2 nodes configured
8 resources configured
Online: [ r2nren.et.cesnet.cz ]
OFFLINE: [ r1nren.et.cesnet.cz ]
Full list of resources:
Clone Set: clone_ping_gw [ping_gw]
Started: [ r2nren.et.cesnet.cz ]
Stopped: [ r1nren.et.cesnet.cz ]
Resource Group: group_eduroam.cz
standby_ip (ocf::heartbeat:IPaddr2): Started r2nren.et.cesnet.cz
offline_file (systemd:offline_file): Started r2nren.et.cesnet.cz
racoon (systemd:racoon): Started r2nren.et.cesnet.cz
radiator (systemd:radiator): Started r2nren.et.cesnet.cz
eduroam_ping (systemd:eduroam_ping): Started r2nren.et.cesnet.cz
mailto (ocf::heartbeat:MailTo): Started r2nren.et.cesnet.cz
What could be the problem I encountered?
Thanks for help.
Regards,
Vaclav
--
Václav Mach
CESNET, z.s.p.o.
www.cesnet.cz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3710 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20171010/d449801e/attachment-0002.p7s>
More information about the Users
mailing list