[ClusterLabs] node utilization attributes are lost during upgrade

Ken Gaillot kgaillot at redhat.com
Tue Aug 18 16:02:07 EDT 2020


On Tue, 2020-08-18 at 22:45 +0300, Strahil Nikolov wrote:
> Won't it be easier if:
> - set a node in standby
> - stop a node
> - remove the node
> - add again with the new hostname

The hostname stays the same, but corosync is changing the numeric node
ID as part of the upgrade. If they remove the node, they'll lose its
utilization attributes, which is what they want to keep.

Looking at it again, I'm guessing there are no explicit node IDs in
corosync.conf, and corosync is choosing the IDs. In that case the
easiest approach would be to explicitly set the original node IDs in
corosync.conf before the upgrade, so they don't change.

> 
> Best Regards,
> Strahil Nikolov
> 
> На 18 август 2020 г. 17:15:49 GMT+03:00, Ken Gaillot <
> kgaillot at redhat.com> написа:
> > On Tue, 2020-08-18 at 14:35 +0200, Kadlecsik József wrote:
> > > Hi,
> > > 
> > > On Mon, 17 Aug 2020, Ken Gaillot wrote:
> > > 
> > > > On Mon, 2020-08-17 at 12:12 +0200, Kadlecsik József wrote:
> > > > > 
> > > > > At upgrading a corosync/pacemaker/libvirt/KVM cluster from
> > > > > Debian 
> > > > > stretch to buster, all the node utilization attributes were
> > > > > erased 
> > > > > from the configuration. However, the same attributes were
> > > > > kept at
> > > > > the 
> > > > > VirtualDomain resources. This resulted that all resources
> > > > > with 
> > > > > utilization attributes were stopped.
> > > > 
> > > > Ouch :(
> > > > 
> > > > There are two types of node attributes, transient and
> > > > permanent. 
> > > > Transient attributes last only until pacemaker is next stopped
> > > > on
> > > > the 
> > > > node, while permanent attributes persist between
> > > > reboots/restarts.
> > > > 
> > > > If you configured the utilization attributes with crm_attribute
> > > > -z/ 
> > > > --utilization, it will default to permanent, but it's possible
> > > > to 
> > > > override that with -l/--lifetime reboot (or equivalently, -t/
> > > > --type 
> > > > status).
> > > 
> > > The attributes were defined by "crm configure edit", simply
> > > stating:
> > > 
> > > node 1084762113: atlas0 \
> > >         utilization hv_memory=192 cpu=32 \
> > >         attributes standby=off
> > > ...
> > > node 1084762119: atlas6 \
> > >         utilization hv_memory=192 cpu=32 \
> > > 
> > > But I believe now that corosync caused the problem, because the
> > > nodes
> > > had 
> > > been renumbered:
> > 
> > Ah yes, that would do it. Pacemaker would consider them different
> > nodes
> > with the same names. The "other" node's attributes would not apply
> > to
> > the "new" node.
> > 
> > The upgrade procedure would be similar except that you would start
> > corosync by itself after each upgrade. After all nodes were
> > upgraded,
> > you would modify the CIB on one node (while pacemaker is not
> > running)
> > with:
> > 
> > CIB_file=/var/lib/pacemaker/cib/cib.xml cibadmin --modify --
> > scope=nodes
> > -X '...'
> > 
> > where '...' is a <node> XML entry from the CIB with the "id" value
> > changed to the new ID, and repeat that for each node. Then, start
> > pacemaker on that node and wait for it to come up, then start
> > pacemaker
> > on the other nodes.
> > 
> > > 
> > > node 3232245761: atlas0
> > > ...
> > > node 3232245767: atlas6
> > > 
> > > The upgrade process was:
> > > 
> > > for each node do
> > >     set the "hold" mark on the corosync package
> > >     put the node standby
> > >     wait for the resources to be migrated off
> > >     upgrade from stretch to buster
> > >     reboot
> > >     put the node online
> > >     wait for the resources to be migrated (back)
> > > done
> > > 
> > > Up to this point all resources were running fine.
> > > 
> > > In order to upgrade corosync, we followed the next steps:
> > > 
> > > enable maintenance mode
> > > stop pacemaker and corosync on all nodes
> > > for each node do
> > >     delete the hold mark and upgrade corosync
> > >     install new config file (nodeid not specified)
> > >     restart corosync, start pacemaker
> > > done
> > > 
> > > We could see that all resources were running unmanaged. When
> > > disabling the 
> > > maintenance mode, then those were stopped.
> > > 
> > > So I think corosync renumbered the nodes and I suspect the reason
> > > for
> > > that 
> > > was that "clear_node_high_bit: yes" was not specified in the new
> > > config 
> > > file. It means it was an admin error then.
> > > 
> > > Best regards,
> > > Jozsef
> > > --
> > > E-mail : kadlecsik.jozsef at wigner.hu
> > > PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt
> > > Address: Wigner Research Centre for Physics
> > >          H-1525 Budapest 114, POB. 49, Hungary
> > 
> > -- 
> > Ken Gaillot <kgaillot at redhat.com>
> > 
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
> 
> 
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list