[ClusterLabs] pacemaker and cluster hostname reconfiguration

Thu Oct 1 04:51:36 EDT 2020

Ciao,

[TXT version]

I'm among the people that have to deal with with the in-famous two nodes
problem (http://www.beekhof.net/blog/2018/two-node-problems).

I am not sure if to open a bug for this.. so I'm first off reporting on
the list.. in the hope to get fast feedback.

Problem statement

I have a cluster made by two nodes with a DRBD shared partition which
some resources (systemd services) have to stick to.

Software versions

     corosync -v
     Corosync Cluster Engine, version '2.4.5'
     Copyright (c) 2006-2009 Red Hat, Inc.

     pacemakerd --version
     Pacemaker 1.1.21-4.el7

     drbdadm --version
     DRBDADM_BUILDTAG=GIT-hash:\
fb98589a8e76783d2c56155c645dbaf02ac7ece7\ build\ by\ mockbuild@\,\
2020-04-05\ 03:21:05
     DRBDADM_API_VERSION=2
     DRBD_KERNEL_VERSION_CODE=0x090010
     DRBD_KERNEL_VERSION=9.0.16
     DRBDADM_VERSION_CODE=0x090c02
     DRBDADM_VERSION=9.12.2

corosync.conf nodes:

nodelist {
     node {
         ring0_addr: 10.1.3.1
         nodeid: 1
     }
     node {
         ring0_addr: 10.1.3.2
         nodeid: 2
     }
}
quorum {
     provider: corosync_votequorum
     two_node: 1
}

drbd nodes config:

resource myresource {

   volume 0 {
     device    /dev/drbd0;
     disk      /dev/mapper/vg0-res--etc;
     meta-disk internal;
   }

   on 123z555666y0 {
     node-id 0;
     address 10.1.3.1:7789;
   }

   on 123z555666y1 {
     node-id 1;
     address 10.1.3.2:7789;
   }

   connection {
     host 123z555666y0;
     host 123z555666y1;
   }

   handlers {
     before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh";
     after-resync-target "/usr/lib/drbd/unsnapshot-resync-target-lvm.sh";
   }

}

I need to reconfigure the hostname of both the nodes of the cluster.
I've gathered some literature around

     https://pacemaker.oss.clusterlabs.narkive.com/csHZkR5R/change-hostname

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-node-name.html
     https://www.suse.com/support/kb/doc/?id=000018878 <- DIDN'T  WORK
     https://bugs.clusterlabs.org/show_bug.cgi?id=5265 <- DIDN'T  WORK

but have not yet found a way to address this (unless with simultaneous
reboot of both nodes).

The procedure:

     Update the hostname on both Master and Slave nodes
         update /etc/hostname
         update /etc/hosts
         update system with hostname -F /etc/hostname
     Reconfigure drbd on Master and Slave nodes
         modify drbd.01.conf (attached) to reflect new hostname
         invoke drbdadm adjust all
     Update pacemaker config on Master node only
         crm configure property maintenance-mode=true
         crm configure delete --force 1
         crm configure delete --force 2
         crm configure xml '<node id="1" uname="newhostname0">
                 <instance_attributes id="node-1">
                   <nvpair id="node-1-standby" name="standby" value="off"/>
                 </instance_attributes>
               </node>'
         crm configure xml '<node id="2" uname="newhostname1">
                 <instance_attributes id="node-2">
                   <nvpair id="node-2-standby" name="standby" value="off"/>
                 </instance_attributes>
               </node>'
         crm resource reprobe
         crm configure refresh
         crm configure property maintenance-mode=false

Let's say for example that I migrate the hostnames like this

hostname10 -> hostname20
hostname11 -> hostname21

After the above procedure is concluded the cluster is correctly
reconfigured and when I check with crm_mon or crm status or crm
configure show xml or even by inspecting the cib.xml I find the proper
new hostnames fetched by pacemaker/corosync (hostname20 and hostname21).

The documentation reports that pacemaker node name is taken from

     corosync.conf nodelist->ring0_addr if not an ip address: NOT MY
CASE => skip
     corosync.conf nodelist->name if available: NOT MY CASE => skip
     uname -n [SHOULD BE IN HERE]

Apparently case number 3 does not apply:

[root at hostname20 ~]# crm_node -n
hostname10
[root at hostname20 ~]# uname -n
hostname20

This becomes evident as soon as I reboot/poweroff one of  the two nodes:
crm_mon which after the reconfiguration was correctly showing

Online: [ hostname21 hostname20 ]

"rolls back" the configuration without any notice and starts showing the
old one

Online: [ hostname10 ]
OFFLINE: [ hostname11 ]

Do you have any idea of where on heath pacemaker is recovering the old
hostnames ?

I've even checked  the code and see that there are cmaps involved so I
suspect there's some caching issues involved in this.

It looks like it is retaining the old hostnames in memory and when
something .. "fails" it restores them.

Besides don't blame me for this use case (reconfigure hostnames in a
two-nodes cluster), as I didn't make it up. I just carry the pain.

R
________________________________

Riccardo Manfrin
R&D DEPARTMENT
Web<https://www.athonet.com/> | LinkedIn<https://www.linkedin.com/company/athonet/>     t +39 (0)444 750045
e riccardo.manfrin at athonet.com<mailto:riccardo.manfrin at athonet.com>
[https://www.athonet.com/signature/logo_athonet.png]<https://www.athonet.com/>
ATHONET | Via Cà del Luogo, 6/8 - 36050 Bolzano Vicentino (VI) Italy
This email and any attachments are confidential and intended solely for the use of the intended recipient. If you are not the named addressee, please be aware that you shall not distribute, copy, use or disclose this email. If you have received this email by error, please notify us immediately and delete this email from your system. Email transmission cannot be guaranteed to be secured or error-free or not to contain viruses. Athonet S.r.l. processes any personal data exchanged in email correspondence in accordance with EU Reg. 679/2016 (GDPR) - you may find here the privacy policy with information on such processing and your rights. Any views or opinions presented in this email are solely those of the sender and do not necessarily represent those of Athonet S.r.l.