[ClusterLabs] What/how to clean up when bootstrapping new cluster (or: I have a phantom node)

Tue May 24 17:34:10 EDT 2022

On Tue, 2022-05-24 at 20:05 +0000, Andreas Hasenack wrote:
> Hi,
> 
> I'm trying to find out the correct steps to start a
> corosync/pacemaker
> cluster right after installing its packages in Debian or Ubuntu.
> 
> I'm not using crmsh or pcs on purpose, I really wanted to get this
> basic initial step working without those.
> 
> Right after install, the default config has this nodelist:
> nodelist {
>         # Change/uncomment/add node sections to match cluster
> configuration
> 
>         node {
>                 # Hostname of the node
>                 name: node1
>                 # Cluster membership node identifier
>                 nodeid: 1
>                 # Address of first link
>                 ring0_addr: 127.0.0.1
>                 # When knet transport is used it's possible to define
> up to 8 links
>                 #ring1_addr: 192.168.1.1
>         }
>         # ...
> }
> 
> 
> (full default pristine config: 
> https://pastebin.ubuntu.com/p/htBkCvBWqr/)
> 
> This results in a crm_mon output of:
> 
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: node1 (version 2.0.3-4b1f869f0f) - partition with
> quorum
>   * Last updated: Tue May 24 19:57:05 2022
>   * Last change:  Tue May 24 19:56:59 2022 by hacluster via crmd on
> node1
>   * 1 node configured
>   * 0 resource instances configured
> 
> Node List:
>   * Online: [ node1 ]
> 
> Active Resources:
>   * No active resources
> 
> I also tried with corosync 3.1.6 and pacemaker 2.1.2, btw.
> 
> I then proceed to making changes to corosync.conf. I give it a real
> hostname, ring IP and node id:
> nodelist {
>         # Change/uncomment/add node sections to match cluster
> configuration
> 
>         node {
>                 # Hostname of the node
>                 name: f4
>                 # Cluster membership node identifier
>                 nodeid: 104
>                 # Address of first link
>                 ring0_addr: 10.226.63.102
>                 # When knet transport is used it's possible to define
> up to 8 links
>                 #ring1_addr: 192.168.1.1
>         }
>         # ...
> }
> 
> 
> Restart the services:
> systemctl restart pacemaker corosync
> 
> But now I have this phantom "node1" in the cluster, and the cluster
> thinks it has two nodes:
> 
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: f4 (version 2.0.3-4b1f869f0f) - partition with quorum
>   * Last updated: Tue May 24 19:59:56 2022
>   * Last change:  Tue May 24 19:59:22 2022 by hacluster via crmd on
> f4
>   * 2 nodes configured
>   * 0 resource instances configured
> 
> Node List:
>   * Node node1: UNCLEAN (offline)
>   * Online: [ f4 ]
> 
> Active Resources:
>   * No active resources
> 
> 
> What is the cleanup step (or steps) that I'm missing? Or are there so
> many details that it's best to leave this to pcs/crmsh?

crm_node --remove node1

or just don't start pacemaker until corosync is correct. pcs/crmsh are
definitely much easier to use (especially as the number of nodes grows)
but if you're looking to learn low-level details, there's nothing wrong
with that.
-- 
Ken Gaillot <kgaillot at redhat.com>