[ClusterLabs] corosync won't start after node failure
Murat Inal
mrt_nl at hotmail.com
Wed Sep 11 21:27:47 UTC 2024
Hello Ken,
I think I have resolved the problem on my own.
Yes, right after the boot, corosync fails to come up. Problem appears to
be related to name resolution. I ran corosync foreground and did a stack
trace: corosync froze and strace output was suspicious with many name
resolution-like calls.
In my failing cluster, I am running containerized BIND9 for regular name
resolution services. Both nodes are running systemd-resolved for
localhost's name resolution. Below are relevant directives of resolved.conf:
DNS=10.1.5.30
#DNS=1.2.3.4
#FallbackDNS=
10.1.5.30/29 is the virtual IP address for the nodes where BIND9 can be
queried. This VIP and BIND9 container are managed by pacemaker, so after
a reboot, node does NOT have the VIP and there is NO container running.
When I changed the directives as;
#DNS=10.1.5.30
DNS=1.2.3.4
#FallbackDNS=
corosync runs perfectly, successful cluster launch follows. 1.2.3.4 is a
false address. Node does NOT have a default route before cluster launch.
Obviously node does NOT receive any replies to its name queries while
corosync is coming up. However, both nodes have a valid address,
10.1.5.25/29 and 10.1.5.26/29 after a reboot. It is a fact that
10.1.5.24/29 subnet is locally attached at both nodes.
Last discovery to mention is that I monitored LOCAL name resolutions
while corosync starts ("sudo resolvectl monitor"). Monitoring
immediately displayed PTR queries for ALL LOCAL IP addresses of the node.
Based on the above, my conclusion is -there is something going bad with
name resolutions using non-existent VIP address-. In my first message, I
mentioned that I was only able to recover corosync by REINSTALLING it
from the repo. In order to reinstall, I was setting the default route
and name server address (8.8.8.8) manually in order to run an effective
"apt reinstall corosync". Hence, I was unintentionally configuring a DNS
server for systemd-resolved. So it was NOT about reinstalling corosync
but letting systemd-resolved use some non-local name server address.
I am using corosync/pacemaker for a couple of years in production,
probably since Ubuntu Server release 21.10 and never encountered such a
problem until now. I wrote an ansible playbook to toggle
systemd-resolved's DNS directive, however I think this glitch SHOULD NOT
exist.
I will be glad if I receive comments on the above.
Regards,
On 8/20/24 21:55, Ken Gaillot wrote:
> On Mon, 2024-08-19 at 12:58 +0300, Murat Inal wrote:
>> [Resending the below due to message format problem]
>>
>>
>> Dear List,
>>
>> I have been running two different 3-node clusters for some time. I
>> am
>> having a fatal problem with corosync: After a node failure, rebooted
>> node does NOT start corosync.
>>
>> Clusters;
>>
>> * All nodes are running Ubuntu Server 24.04
>> * corosync is 3.1.7
>> * corosync-qdevice is 3.0.3
>> * pacemaker is 2.1.6
>> * The third node at both clusters is a quorum device. Cluster is on
>> ffsplit algorithm.
>> * All nodes are baremetal & attached to a dedicated kronosnet
>> network.
>> * STONITH is enabled in one of the clusters and disabled for the
>> other.
>>
>> corosync & pacemaker service starts (systemd) are disabled. I am
>> starting any cluster with the command pcs cluster start.
>>
>> corosync NEVER starts AFTER a node failure (node is rebooted). There
> Do you mean that the first time you run "pcs cluster start" after a
> node reboot, corosync does not come up completely?
>
> Try adding "debug: on" to the logging section of
> /etc/corosync/corosync.conf
>
>> is
>> nothing in /var/log/corosync/corosync.log, service freezes as:
>>
>> Aug 01 12:54:56 [3193] charon corosync notice [MAIN ] Corosync
>> Cluster
>> Engine 3.1.7 starting up
>> Aug 01 12:54:56 [3193] charon corosync info [MAIN ] Corosync
>> built-in features: dbus monitoring watchdog augeas systemd xmlconf
>> vqsim
>> nozzle snmp pie relro bindnow
>>
>> corosync never starts kronosnet. I checked kronosnet interfaces, all
>> OK,
>> there is IP connectivity in between. If I do corosync -t, it is the
>> same
>> freeze.
>>
>> I could ONLY manage to start corosync by reinstalling it: apt
>> reinstall
>> corosync ; pcs cluster start.
>>
>> The above issue repeated itself at least 5-6 times. I do NOT see
>> anything in syslog either. I will be glad if you lead me on how to
>> solve
>> this.
>>
>> Thanks,
>>
>> Murat
>>
More information about the Users
mailing list