[ClusterLabs] Rebuild of failed node
alexey at pavlyuts.ru
alexey at pavlyuts.ru
Mon May 12 20:36:11 UTC 2025
Hi,
Occasionally, I have pacemaker as a base layer of custom clustering solution and I have a script to rebuild the second node from the first one. I can’t share the script itself as is has a lot of solution-dependent references, but I can share the sequence to rebuild the failed node:
1. Setup the new node with the same IP and hostname
2. (optional) setup passwordless mutual key-based SSH access. It is not necessary, but make a lot of things easy.
3. Copy files from survived host to the new one:
a. /etc/corosync/authkey
b. /etc/corosync/corosync.conf
c. /etc/drbd.d/*.res
d. /etc/pacemaker/authkey
4. Set hacluster user pass to the same as it was on the survived node.
5. Re-auth pcs nodes with command
pcs host auth <host1_name> <host2_name> -u hacluster -p <ha_cluster_pass>
6. Reboot the restored server
7. PROFIT!!!
If you use no arbiter (corosync-qnetd) this should be enough for your new cluster node up and running. If you use corosync-qnetd, you need also restore corosync-qdevice nssdb keys for the second host connect the arbiter node:
1. On old host, extract your arbiter certificate from nssdb on the survived host:
certutil -L -d /etc/corosync/qdevice/net/nssdb -n 'QNet CA' -r > /root/qnetd-cert.crt
2. Copy certificate to the new host, assume the path on the new host is the same
3. On the new host, Init new nssdb with certificate:
corosync-qdevice-net-certutil -i -c /root/qnetd-cert.crt
4. Copy certificate and key at location /etc/corosync/qdevice/net/nssdb/qdevice-net-node.p12 from old node to new one
5. On the new node: Import certificate and key:
corosync-qdevice-net-certutil -m -c /etc/corosync/qdevice/net/nssdb/qdevice-net-node.p12
6. Enable or restart corosync-qdevice:
systemctl enable –now corosync-qdevice.service
or
systemctl restart corosync-qdevice.service
7. Enjoy!
That’s what practically work for me and included in service scripts of our product, based on Pacemaker.
Hope this could help!
Sincerely,
Alex
From: Users <users-bounces at clusterlabs.org> On Behalf Of Fabrizio Ermini
Sent: Friday, May 9, 2025 5:26 PM
To: users at clusterlabs.org
Subject: [ClusterLabs] Rebuild of failed node
Hi all! Freshmen here, just joined.
I'm currently in the need to rebuild a failed node on a pacemaker2.1/corosync3.1 2-node cluster with drbd storage.
I've searched in Pacemaker docs and in the list archives, but I haven't found a clear guide on how to proceed in this task. So far, I've reinstalled a new server, configured the same IP and hostname of the failed one, and installed all the software. I've also fixed DRBD layer and started the resync of the volumes. But it's not clear to me how to proceed - I've found some hints online pointing to the need of manually copying corosync config, but they were quite old and probably obsolete. I'm using pcs as a shell and I haven't found a command designed to replace a node, only to add or remove them.
It seems really strange to me that there isn't a guide, since this should be a very basic operation and it's quite important to know how to do it - HW breaks, as a matter of fact :D
So I'll be very grateful if anyone can point me in the right direction.
Thanks in advance, and best regards
Fabrizio
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250512/b2f7c3d7/attachment-0001.htm>
More information about the Users
mailing list