[ClusterLabs] pacemaker won't start because duplicate node but can't remove dupe node because pacemaker won't start
JC
snafuxnj at yahoo.com
Wed Dec 18 02:44:25 EST 2019
I’ve asked this question on server fault and I’ll re-ask the whole thing here for posterity sake:
https://serverfault.com/questions/995981/pacemaker-wont-start-because-duplicate-node-but-cant-remove-dupe-node-because <https://serverfault.com/questions/995981/pacemaker-wont-start-because-duplicate-node-but-cant-remove-dupe-node-because>
OK! Really new to pacemaker/corosync, like 1 day new.
Software: Ubuntu 18.04 LTS and the versions associated with that distro.
pacemakerd: 1.1.18
corosync: 2.4.3
I accidentally removed the nodes from my entire test cluster (3 nodes)
When I tried to bring everything back up using `pcsd` GUI, that failed because the nodes were "wiped out". Cool.
So. I had a copy of the last `corosync.conf` from my "primary" node. I copied to the other two nodes. I fixed the `bindnetaddr` on the respective confs. I ran `pcs cluster start` on my "primary" node.
One of the nodes failed to come up. I took a look at the status of `pacemaker` on that node and I get the following exception:
Dec 18 06:33:56 region-ctrl-2 crmd[1049]: crit: Nodes 1084777441 and 2 share the same name 'region-ctrl-2': shutting down
I tried running `crm_node -R --force 1084777441` on the machine where `pacemaker` won't start, but of course, `pacemaker` isn't running so I get an `crmd: connection refused (111)` error. So, I ran the same command on one of the healthy nodes, which shows no errors, but the node never goes away and `pacemaker` on the affected machine continued to show the same error.
So, I decided to tear down the entire cluster and again. I purge removed all the packages from the machine. I reinstalled everything fresh. I copied and fixed the `corosync.conf` to the machine. I recreated the cluster. I get the exact same bloody error.
So this node named `1084777441` is not a machine I created. This is one the cluster created for me. Earlier in the day I realized that I was using IP addresses in `corosync.conf` instead of names. I fixed the `/etc/hosts` of the machines, removed the IP addresses from the corosync config, and that's why I inadvertently deleted my whole cluster in the first place (I removed the nodes that were IP addresses).
The following is my corosync.conf:
totem {
version: 2
cluster_name: maas-cluster
token: 3000
token_retransmits_before_loss_const: 10
clear_node_high_bit: yes
crypto_cipher: none
crypto_hash: none
interface {
ringnumber: 0
bindnetaddr: 192.168.99.225
mcastport: 5405
ttl: 1
}
}
logging {
fileline: off
to_stderr: no
to_logfile: no
to_syslog: yes
syslog_facility: daemon
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}
quorum {
provider: corosync_votequorum
expected_votes: 3
two_node: 1
}
nodelist {
node {
ring0_addr: postgres-sb
nodeid: 3
}
node {
ring0_addr: region-ctrl-2
nodeid: 2
}
node {
ring0_addr: region-ctrl-1
nodeid: 1
}
}
The only thing different about this conf between the nodes is the `bindnetaddr`.
There seems to be a chicken/egg issue here unless there's some way of which I'm not aware to remove a node from a flat-file db or sqlite dbb somewhere or there's some other more authoritative way to remove a node from a cluster.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20191217/b9676dc3/attachment.html>
More information about the Users
mailing list