[ClusterLabs] Corosync/Pacemaker bug methinks! (was: pacemaker won't start because duplicate node but can't remove dupe node because pacemaker won't start)
JC
snafuxnj at yahoo.com
Wed Dec 18 15:11:18 EST 2019
Adding More Context
I've made sure that `/etc/hosts` and the hostname of each of the machines match. I forgot to mention that.
127.0.0.1 localhost
127.0.1.1 postgres
192.168.99.224 postgres-sb
192.168.99.223 region-ctrl-1
192.168.99.225 region-ctrl-2
192.168.7.224 postgres-sb
192.168.7.223 region-ctrl-1
192.168.7.225 region-ctrl-2
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
I decided to try to start from scratch. I `apt removed --purge`ed `corosync*`, `pacemaker*` `crmsh`, and `pcs`. I `rm -rf`ed `/etc/corosync`. I kept a copy of the `corosync.conf` on each machine.
I re-installed all the things on each of the machines. I copied my saved `corosync.conf` to `/etc/corosync/` and restarted `corosync` on all the machines.
I *STILL* get the same exact error. This has to be a bug in one of the components!
So it seems that `crm_get_peer` is failing to recognize that the host named `region-ctrl-2` is assigned nodeid 2 in `corosync.conf`. Node 2 then gets auto-assigned an ID of 1084777441. This is the part that doesn't make sense to me. The hostname of the machine is `region-ctrl-2` set in `/etc/hostname` and `/etc/hosts` and confirmed using `uname -n`. The `corosync.conf` is explicitly assigning an ID to the machine named `region-ctrl-2` but something is apparently not recognizing the assignment from `corosync` and instead assigned a non-randomized ID with the value 1084777441 to this host. How the freak do I fix this?
More information about the Users
mailing list