[ClusterLabs] corosync service not automatically started

Tue Oct 10 04:35:17 EDT 2017

Hello,

I'm working on a two node cluster. The nodes are r1nren (r1) and r2nren 
(r2). There are some resources at the moment, but I think it's not 
important for this problem.

Both nodes are virtual servers running on vmware. Both nodes are running 
debian strech, I'm using corosync and pacemaker for the cluster. 
Complete list of used version below:

root at r2nren:~# uname -a
Linux r2nren.et.cesnet.cz 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u3 
(2017-08-06) x86_64 GNU/Linux
root at r2nren:~# dpkg -l | grep corosync
ii  corosync                          2.4.2-3 
amd64        cluster engine daemon and utilities
ii  libcorosync-common4:amd64         2.4.2-3 
amd64        cluster engine common library
root at r2nren:~# dpkg -l | grep pacemaker
ii  crmsh                             2.3.2-4                        all 
          CRM shell for the pacemaker cluster manager
ii  pacemaker                         1.1.16-1 
amd64        cluster resource manager
ii  pacemaker-cli-utils               1.1.16-1 
amd64        cluster resource manager command line utilities
ii  pacemaker-common                  1.1.16-1                       all 
          cluster resource manager common files
ii  pacemaker-resource-agents         1.1.16-1                       all 
          cluster resource manager general resource agents

root at r1nren:~# uname -a
Linux r1nren.et.cesnet.cz 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u3 
(2017-08-06) x86_64 GNU/Linux
root at r1nren:~# dpkg -l | grep corosync
ii  corosync                          2.4.2-3 
amd64        cluster engine daemon and utilities
ii  libcorosync-common4:amd64         2.4.2-3 
amd64        cluster engine common library
root at r1nren:~# dpkg -l | grep pacemaker
ii  crmsh                             2.3.2-4                        all 
          CRM shell for the pacemaker cluster manager
ii  pacemaker                         1.1.16-1 
amd64        cluster resource manager
ii  pacemaker-cli-utils               1.1.16-1 
amd64        cluster resource manager command line utilities
ii  pacemaker-common                  1.1.16-1                       all 
          cluster resource manager common files
ii  pacemaker-resource-agents         1.1.16-1                       all 
          cluster resource manager general resource agents

When the cluster is operating fine, the state is:
root at r2nren:~# crm status
Stack: corosync
Current DC: r1nren.et.cesnet.cz (version 1.1.16-94ff4df) - partition 
with quorum
Last updated: Tue Oct 10 10:12:22 2017
Last change: Mon Oct  9 13:09:59 2017 by root via crm_attribute on 
r1nren.et.cesnet.cz

2 nodes configured
8 resources configured

Online: [ r1nren.et.cesnet.cz r2nren.et.cesnet.cz ]

Full list of resources:

  Clone Set: clone_ping_gw [ping_gw]
      Started: [ r1nren.et.cesnet.cz r2nren.et.cesnet.cz ]
  Resource Group: group_eduroam.cz
      standby_ip	(ocf::heartbeat:IPaddr2):	Started r1nren.et.cesnet.cz
      offline_file	(systemd:offline_file):	Started r1nren.et.cesnet.cz
      racoon	(systemd:racoon):	Started r1nren.et.cesnet.cz
      radiator	(systemd:radiator):	Started r1nren.et.cesnet.cz
      eduroam_ping	(systemd:eduroam_ping):	Started r1nren.et.cesnet.cz
      mailto	(ocf::heartbeat:MailTo):	Started r1nren.et.cesnet.cz

I've discovered that if i reboot any of the nodes using just command 
'reboot' from terminal or if reboot them from the vmware web interface, 
everything performs fine. The node undergoing reboot disconnects from 
cluster and reconnects again.

The problem appears when I shutdown (guest os shutdown not force 
shutdown) the machine from vmware web interface and start it again. The 
machine is unable to join the cluster. Pacemaker and corosync are not 
running. The pacemaker says, the it failed on dependency, which is 
obviuously corosync.

The corosync says:

root at r1nren:~# crm status
ERROR: status: crm_mon (rc=107): Connection to cluster failed: Transport 
endpoint is not connected

root at r1nren:~# service corosync status
● corosync.service - Corosync Cluster Engine
    Loaded: loaded (/lib/systemd/system/corosync.service; enabled; 
vendor preset: enabled)
    Active: failed (Result: signal) since Tue 2017-10-10 10:27:10 CEST; 
1min 10s ago
      Docs: man:corosync
            man:corosync.conf
            man:corosync_overview
   Process: 709 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS 
(code=killed, signal=ABRT)
  Main PID: 709 (code=killed, signal=ABRT)

Oct 10 10:27:05 r1nren.et.cesnet.cz corosync[709]:   [QB    ] Denied 
connection, is not ready (709-1337-18)
Oct 10 10:27:06 r1nren.et.cesnet.cz corosync[709]:   [QB    ] Denied 
connection, is not ready (709-1337-18)
Oct 10 10:27:07 r1nren.et.cesnet.cz corosync[709]:   [QB    ] Denied 
connection, is not ready (709-1337-18)
Oct 10 10:27:08 r1nren.et.cesnet.cz corosync[709]:   [QB    ] Denied 
connection, is not ready (709-1337-18)
Oct 10 10:27:09 r1nren.et.cesnet.cz corosync[709]:   [QB    ] Denied 
connection, is not ready (709-1337-18)
Oct 10 10:27:10 r1nren.et.cesnet.cz corosync[709]: corosync: 
votequorum.c:2065: message_handler_req_exec_votequorum_nodeinfo: 
Assertion `sender_node != NULL' failed.
Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Main 
process exited, code=killed, status=6/ABRT
Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: Failed to start Corosync 
Cluster Engine.
Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Unit 
entered failed state.
Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Failed 
with result 'signal'.

root at r1nren:~# journalctl -u corosync
Oct 10 10:26:58 r1nren.et.cesnet.cz systemd[1]: Starting Corosync 
Cluster Engine...
Oct 10 10:26:58 r1nren.et.cesnet.cz corosync[709]:   [MAIN  ] Corosync 
Cluster Engine ('2.4.2'): started and ready to provide service.
Oct 10 10:26:58 r1nren.et.cesnet.cz corosync[709]:   [MAIN  ] Corosync 
built-in features: dbus rdma monitoring watchdog augeas systemd upstart 
xmlconf qdevices qnetd snm
Oct 10 10:26:58 r1nren.et.cesnet.cz corosync[709]:   [TOTEM ] 
Initializing transport (UDP/IP Unicast).
Oct 10 10:26:58 r1nren.et.cesnet.cz corosync[709]:   [TOTEM ] 
Initializing transmit/receive security (NSS) crypto: aes256 hash: sha256
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [TOTEM ] The 
network interface is down.
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [SERV  ] Service 
engine loaded: corosync configuration map access [0]
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [QB    ] server 
name: cmap
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [SERV  ] Service 
engine loaded: corosync configuration service [1]
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [QB    ] server 
name: cfg
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [SERV  ] Service 
engine loaded: corosync cluster closed process group service v1.01 [2]
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [QB    ] server 
name: cpg
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [SERV  ] Service 
engine loaded: corosync profile loading service [4]
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [SERV  ] Service 
engine loaded: corosync resource monitoring service [6]
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [WD    ] No 
Watchdog /dev/watchdog, try modprobe <a watchdog>
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [WD    ] resource 
load_15min missing a recovery key.
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [WD    ] resource 
memory_used missing a recovery key.
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [WD    ] no 
resources configured.
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [SERV  ] Service 
engine loaded: corosync watchdog service [7]
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [QUORUM] Using 
quorum provider corosync_votequorum
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [VOTEQ ] Waiting 
for all cluster members. Current votes: 1 expected_votes: 2
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [SERV  ] Service 
engine loaded: corosync vote quorum service v1.0 [5]
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [QB    ] server 
name: votequorum
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [SERV  ] Service 
engine loaded: corosync cluster quorum service v0.1 [3]
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [QB    ] server 
name: quorum
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [TOTEM ] adding new 
UDPU member {78.128.211.51}
Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]:   [TOTEM ] adding new 
UDPU member {78.128.211.52}
Oct 10 10:27:00 r1nren.et.cesnet.cz corosync[709]:   [QB    ] Denied 
connection, is not ready (709-1337-16)
Oct 10 10:27:01 r1nren.et.cesnet.cz corosync[709]:   [QB    ] Denied 
connection, is not ready (709-1337-16)
Oct 10 10:27:02 r1nren.et.cesnet.cz corosync[709]:   [QB    ] Denied 
connection, is not ready (709-1337-16)
Oct 10 10:27:03 r1nren.et.cesnet.cz corosync[709]:   [TOTEM ] The 
network interface [78.128.211.51] is now up.
Oct 10 10:27:03 r1nren.et.cesnet.cz corosync[709]:   [TOTEM ] adding new 
UDPU member {78.128.211.51}
Oct 10 10:27:03 r1nren.et.cesnet.cz corosync[709]:   [TOTEM ] adding new 
UDPU member {78.128.211.52}
Oct 10 10:27:03 r1nren.et.cesnet.cz corosync[709]:   [QB    ] Denied 
connection, is not ready (709-1337-18)
Oct 10 10:27:04 r1nren.et.cesnet.cz corosync[709]:   [QB    ] Denied 
connection, is not ready (709-1337-18)
Oct 10 10:27:05 r1nren.et.cesnet.cz corosync[709]:   [QB    ] Denied 
connection, is not ready (709-1337-18)
Oct 10 10:27:06 r1nren.et.cesnet.cz corosync[709]:   [QB    ] Denied 
connection, is not ready (709-1337-18)
Oct 10 10:27:07 r1nren.et.cesnet.cz corosync[709]:   [QB    ] Denied 
connection, is not ready (709-1337-18)
Oct 10 10:27:08 r1nren.et.cesnet.cz corosync[709]:   [QB    ] Denied 
connection, is not ready (709-1337-18)
Oct 10 10:27:09 r1nren.et.cesnet.cz corosync[709]:   [QB    ] Denied 
connection, is not ready (709-1337-18)
Oct 10 10:27:10 r1nren.et.cesnet.cz corosync[709]: corosync: 
votequorum.c:2065: message_handler_req_exec_votequorum_nodeinfo: 
Assertion `sender_node != NULL' failed.
Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Main 
process exited, code=killed, status=6/ABRT
Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: Failed to start Corosync 
Cluster Engine.
Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Unit 
entered failed state.
Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Failed 
with result 'signal'.

corosync configuration:
root at r1nren:~# cat /etc/corosync/corosync.conf
totem {
	version: 2
	transport: udpu
	cluster_name: eduroam.cz
	token: 3000
	token_retransmits_before_loss_const: 10
	clear_node_high_bit: yes
	crypto_cipher: aes256
	crypto_hash: sha256
	interface {
		ringnumber: 0
		bindnetaddr: 78.128.211.51
		ttl: 1
	}
}

logging {
	fileline: off
	to_stderr: no
	to_logfile: no
	to_syslog: yes
	syslog_facility: daemon
	debug: off
	timestamp: on
	logger_subsys {
		subsys: QUORUM
		debug: off
	}
}

quorum {
	provider: corosync_votequorum
	expected_votes: 2
	two_node: 1
}

nodelist{
		node {
			ring0_addr: 78.128.211.51
		}
		node {
			ring0_addr: 78.128.211.52
		}
}

Let me know if I can provide any more information about this (are there 
any corosync logs?).

View from r2:
root at r2nren:~# crm status
Stack: corosync
Current DC: r2nren.et.cesnet.cz (version 1.1.16-94ff4df) - partition 
with quorum
Last updated: Tue Oct 10 10:29:45 2017
Last change: Tue Oct 10 10:25:32 2017 by root via crm_attribute on 
r1nren.et.cesnet.cz

2 nodes configured
8 resources configured

Online: [ r2nren.et.cesnet.cz ]
OFFLINE: [ r1nren.et.cesnet.cz ]

Full list of resources:

  Clone Set: clone_ping_gw [ping_gw]
      Started: [ r2nren.et.cesnet.cz ]
      Stopped: [ r1nren.et.cesnet.cz ]
  Resource Group: group_eduroam.cz
      standby_ip	(ocf::heartbeat:IPaddr2):	Started r2nren.et.cesnet.cz
      offline_file	(systemd:offline_file):	Started r2nren.et.cesnet.cz
      racoon	(systemd:racoon):	Started r2nren.et.cesnet.cz
      radiator	(systemd:radiator):	Started r2nren.et.cesnet.cz
      eduroam_ping	(systemd:eduroam_ping):	Started r2nren.et.cesnet.cz
      mailto	(ocf::heartbeat:MailTo):	Started r2nren.et.cesnet.cz

What could be the problem I encountered?

Thanks for help.

Regards,
Vaclav

-- 
Václav Mach
CESNET, z.s.p.o.
www.cesnet.cz

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3710 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20171010/d449801e/attachment-0002.p7s>