[ClusterLabs] corosync won't start after node failure

Wed Sep 11 21:27:47 UTC 2024

Hello Ken,

I think I have resolved the problem on my own.

Yes, right after the boot, corosync fails to come up. Problem appears to 
be related to name resolution. I ran corosync foreground and did a stack 
trace: corosync froze and strace output was suspicious with many name 
resolution-like calls.

In my failing cluster, I am running containerized BIND9 for regular name 
resolution services. Both nodes are running systemd-resolved for 
localhost's name resolution. Below are relevant directives of resolved.conf:

DNS=10.1.5.30
#DNS=1.2.3.4
#FallbackDNS=

10.1.5.30/29 is the virtual IP address for the nodes where BIND9 can be 
queried. This VIP and BIND9 container are managed by pacemaker, so after 
a reboot, node does NOT have the VIP and there is NO container running.

When I changed the directives as;

#DNS=10.1.5.30
DNS=1.2.3.4
#FallbackDNS=

corosync runs perfectly, successful cluster launch follows. 1.2.3.4 is a 
false address. Node does NOT have a default route before cluster launch. 
Obviously node does NOT receive any replies to its name queries while 
corosync is coming up. However, both nodes have a valid address, 
10.1.5.25/29 and 10.1.5.26/29 after a reboot. It is a fact that 
10.1.5.24/29 subnet is locally attached at both nodes.

Last discovery to mention is that I monitored LOCAL name resolutions 
while corosync starts ("sudo resolvectl monitor"). Monitoring 
immediately displayed PTR queries for ALL LOCAL IP addresses of the node.

Based on the above, my conclusion is -there is something going bad with 
name resolutions using non-existent VIP address-. In my first message, I 
mentioned that I was only able to recover corosync by REINSTALLING it 
from the repo. In order to reinstall, I was setting the default route 
and name server address (8.8.8.8) manually in order to run an effective 
"apt reinstall corosync". Hence, I was unintentionally configuring a DNS 
server for systemd-resolved. So it was NOT about reinstalling corosync 
but letting systemd-resolved use some non-local name server address.

I am using corosync/pacemaker for a couple of years in production, 
probably since Ubuntu Server release 21.10 and never encountered such a 
problem until now. I wrote an ansible playbook to toggle 
systemd-resolved's DNS directive, however I think this glitch SHOULD NOT 
exist.

I will be glad if I receive comments on the above.

Regards,

On 8/20/24 21:55, Ken Gaillot wrote:
> On Mon, 2024-08-19 at 12:58 +0300, Murat Inal wrote:
>> [Resending the below due to message format problem]
>>
>>
>> Dear List,
>>
>> I have been running two different 3-node clusters for some time. I
>> am
>> having a fatal problem with corosync: After a node failure, rebooted
>> node does NOT start corosync.
>>
>> Clusters;
>>
>>    * All nodes are running Ubuntu Server 24.04
>>    * corosync is 3.1.7
>>    * corosync-qdevice is 3.0.3
>>    * pacemaker is 2.1.6
>>    * The third node at both clusters is a quorum device. Cluster is on
>>      ffsplit algorithm.
>>    * All nodes are baremetal & attached to a dedicated kronosnet
>> network.
>>    * STONITH is enabled in one of the clusters and disabled for the
>> other.
>>
>> corosync & pacemaker service starts (systemd) are disabled. I am
>> starting any cluster with the command pcs cluster start.
>>
>> corosync NEVER starts AFTER a node failure (node is rebooted). There
> Do you mean that the first time you run "pcs cluster start" after a
> node reboot, corosync does not come up completely?
>
> Try adding "debug: on" to the logging section of
> /etc/corosync/corosync.conf
>
>> is
>> nothing in /var/log/corosync/corosync.log, service freezes as:
>>
>> Aug 01 12:54:56 [3193] charon corosync notice  [MAIN  ] Corosync
>> Cluster
>> Engine 3.1.7 starting up
>> Aug 01 12:54:56 [3193] charon corosync info    [MAIN  ] Corosync
>> built-in features: dbus monitoring watchdog augeas systemd xmlconf
>> vqsim
>> nozzle snmp pie relro bindnow
>>
>> corosync never starts kronosnet. I checked kronosnet interfaces, all
>> OK,
>> there is IP connectivity in between. If I do corosync -t, it is the
>> same
>> freeze.
>>
>> I could ONLY manage to start corosync by reinstalling it: apt
>> reinstall
>> corosync ; pcs cluster start.
>>
>> The above issue repeated itself at least 5-6 times. I do NOT see
>> anything in syslog either. I will be glad if you lead me on how to
>> solve
>> this.
>>
>> Thanks,
>>
>> Murat
>>