[ClusterLabs] corosync won't start after node failure

Sun Oct 6 19:25:21 UTC 2024

More progress on this issue;

I have noticed that a corosync start initiates PTR queries for all of 
the local IP addresses. My production cluster node has many:

  1. area0: 172.30.1.1/27
  2. ctd: 10.1.5.16/31
  3. dep: 10.1.4.2/24
  4. docker0: 172.17.0.1/16
  5. fast: 10.1.5.1/28
  6. gst: 192.168.5.1/24
  7. ha: 10.1.5.25/29
  8. inet: 100.64.64.10/29
  9. iscsi1: 10.1.8.195/28
10. iscsi2: 10.1.8.211/28
11. iscsi3: 10.1.8.227/28
12. knet: 10.1.5.33/28
13. lo0: 10.1.255.1/32
14. lo: 127.0.0.1/8
15. mgmt: 10.1.3.4/24
16. nfpeeringout: 10.1.102.64/31

I created entries at /etc/hosts for all of the above. Corosync freeze 
NEVER happened after that. I have two production clusters, 5 nodes in 
total. I did the same for remaining nodes. Not a single freeze.

BTW, I deleted the (false) DNS=1.2.3.4 entry at 
/etc/systemd/resolved.conf. There is no "workaround" at cluster 
configurations.

Based on the above, I guess that corosync somehow crashes after an 
accumulated period of PTR query timeouts. Please note that there is NO 
name server at the time of cluster launch. So there is no response to 
these queries.

If you think this is a bug, please lead me on how to proceed for 
creating a report.

Thanks,

On 9/12/24 00:27, Murat Inal wrote:
> Hello Ken,
>
> I think I have resolved the problem on my own.
>
> Yes, right after the boot, corosync fails to come up. Problem appears 
> to be related to name resolution. I ran corosync foreground and did a 
> stack trace: corosync froze and strace output was suspicious with many 
> name resolution-like calls.
>
> In my failing cluster, I am running containerized BIND9 for regular 
> name resolution services. Both nodes are running systemd-resolved for 
> localhost's name resolution. Below are relevant directives of 
> resolved.conf:
>
> DNS=10.1.5.30
> #DNS=1.2.3.4
> #FallbackDNS=
>
> 10.1.5.30/29 is the virtual IP address for the nodes where BIND9 can 
> be queried. This VIP and BIND9 container are managed by pacemaker, so 
> after a reboot, node does NOT have the VIP and there is NO container 
> running.
>
> When I changed the directives as;
>
> #DNS=10.1.5.30
> DNS=1.2.3.4
> #FallbackDNS=
>
> corosync runs perfectly, successful cluster launch follows. 1.2.3.4 is 
> a false address. Node does NOT have a default route before cluster 
> launch. Obviously node does NOT receive any replies to its name 
> queries while corosync is coming up. However, both nodes have a valid 
> address, 10.1.5.25/29 and 10.1.5.26/29 after a reboot. It is a fact 
> that 10.1.5.24/29 subnet is locally attached at both nodes.
>
> Last discovery to mention is that I monitored LOCAL name resolutions 
> while corosync starts ("sudo resolvectl monitor"). Monitoring 
> immediately displayed PTR queries for ALL LOCAL IP addresses of the node.
>
> Based on the above, my conclusion is -there is something going bad 
> with name resolutions using non-existent VIP address-. In my first 
> message, I mentioned that I was only able to recover corosync by 
> REINSTALLING it from the repo. In order to reinstall, I was setting 
> the default route and name server address (8.8.8.8) manually in order 
> to run an effective "apt reinstall corosync". Hence, I was 
> unintentionally configuring a DNS server for systemd-resolved. So it 
> was NOT about reinstalling corosync but letting systemd-resolved use 
> some non-local name server address.
>
> I am using corosync/pacemaker for a couple of years in production, 
> probably since Ubuntu Server release 21.10 and never encountered such 
> a problem until now. I wrote an ansible playbook to toggle 
> systemd-resolved's DNS directive, however I think this glitch SHOULD 
> NOT exist.
>
> I will be glad if I receive comments on the above.
>
> Regards,
>
>
> On 8/20/24 21:55, Ken Gaillot wrote:
>> On Mon, 2024-08-19 at 12:58 +0300, Murat Inal wrote:
>>> [Resending the below due to message format problem]
>>>
>>>
>>> Dear List,
>>>
>>> I have been running two different 3-node clusters for some time. I
>>> am
>>> having a fatal problem with corosync: After a node failure, rebooted
>>> node does NOT start corosync.
>>>
>>> Clusters;
>>>
>>>    * All nodes are running Ubuntu Server 24.04
>>>    * corosync is 3.1.7
>>>    * corosync-qdevice is 3.0.3
>>>    * pacemaker is 2.1.6
>>>    * The third node at both clusters is a quorum device. Cluster is on
>>>      ffsplit algorithm.
>>>    * All nodes are baremetal & attached to a dedicated kronosnet
>>> network.
>>>    * STONITH is enabled in one of the clusters and disabled for the
>>> other.
>>>
>>> corosync & pacemaker service starts (systemd) are disabled. I am
>>> starting any cluster with the command pcs cluster start.
>>>
>>> corosync NEVER starts AFTER a node failure (node is rebooted). There
>> Do you mean that the first time you run "pcs cluster start" after a
>> node reboot, corosync does not come up completely?
>>
>> Try adding "debug: on" to the logging section of
>> /etc/corosync/corosync.conf
>>
>>> is
>>> nothing in /var/log/corosync/corosync.log, service freezes as:
>>>
>>> Aug 01 12:54:56 [3193] charon corosync notice  [MAIN  ] Corosync
>>> Cluster
>>> Engine 3.1.7 starting up
>>> Aug 01 12:54:56 [3193] charon corosync info    [MAIN  ] Corosync
>>> built-in features: dbus monitoring watchdog augeas systemd xmlconf
>>> vqsim
>>> nozzle snmp pie relro bindnow
>>>
>>> corosync never starts kronosnet. I checked kronosnet interfaces, all
>>> OK,
>>> there is IP connectivity in between. If I do corosync -t, it is the
>>> same
>>> freeze.
>>>
>>> I could ONLY manage to start corosync by reinstalling it: apt
>>> reinstall
>>> corosync ; pcs cluster start.
>>>
>>> The above issue repeated itself at least 5-6 times. I do NOT see
>>> anything in syslog either. I will be glad if you lead me on how to
>>> solve
>>> this.
>>>
>>> Thanks,
>>>
>>> Murat
>>>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/