[ClusterLabs] node utilization attributes are lost during upgrade

Ken Gaillot kgaillot at redhat.com
Tue Aug 18 10:15:49 EDT 2020


On Tue, 2020-08-18 at 14:35 +0200, Kadlecsik József wrote:
> Hi,
> 
> On Mon, 17 Aug 2020, Ken Gaillot wrote:
> 
> > On Mon, 2020-08-17 at 12:12 +0200, Kadlecsik József wrote:
> > > 
> > > At upgrading a corosync/pacemaker/libvirt/KVM cluster from
> > > Debian 
> > > stretch to buster, all the node utilization attributes were
> > > erased 
> > > from the configuration. However, the same attributes were kept at
> > > the 
> > > VirtualDomain resources. This resulted that all resources with 
> > > utilization attributes were stopped.
> > 
> > Ouch :(
> > 
> > There are two types of node attributes, transient and permanent. 
> > Transient attributes last only until pacemaker is next stopped on
> > the 
> > node, while permanent attributes persist between reboots/restarts.
> > 
> > If you configured the utilization attributes with crm_attribute
> > -z/ 
> > --utilization, it will default to permanent, but it's possible to 
> > override that with -l/--lifetime reboot (or equivalently, -t/
> > --type 
> > status).
> 
> The attributes were defined by "crm configure edit", simply stating:
> 
> node 1084762113: atlas0 \
>         utilization hv_memory=192 cpu=32 \
>         attributes standby=off
> ...
> node 1084762119: atlas6 \
>         utilization hv_memory=192 cpu=32 \
> 
> But I believe now that corosync caused the problem, because the nodes
> had 
> been renumbered:

Ah yes, that would do it. Pacemaker would consider them different nodes
with the same names. The "other" node's attributes would not apply to
the "new" node.

The upgrade procedure would be similar except that you would start
corosync by itself after each upgrade. After all nodes were upgraded,
you would modify the CIB on one node (while pacemaker is not running)
with:

  CIB_file=/var/lib/pacemaker/cib/cib.xml cibadmin --modify --scope=nodes -X '...'

where '...' is a <node> XML entry from the CIB with the "id" value
changed to the new ID, and repeat that for each node. Then, start
pacemaker on that node and wait for it to come up, then start pacemaker
on the other nodes.

> 
> node 3232245761: atlas0
> ...
> node 3232245767: atlas6
> 
> The upgrade process was:
> 
> for each node do
>     set the "hold" mark on the corosync package
>     put the node standby
>     wait for the resources to be migrated off
>     upgrade from stretch to buster
>     reboot
>     put the node online
>     wait for the resources to be migrated (back)
> done
> 
> Up to this point all resources were running fine.
> 
> In order to upgrade corosync, we followed the next steps:
> 
> enable maintenance mode
> stop pacemaker and corosync on all nodes
> for each node do
>     delete the hold mark and upgrade corosync
>     install new config file (nodeid not specified)
>     restart corosync, start pacemaker
> done
> 
> We could see that all resources were running unmanaged. When
> disabling the 
> maintenance mode, then those were stopped.
> 
> So I think corosync renumbered the nodes and I suspect the reason for
> that 
> was that "clear_node_high_bit: yes" was not specified in the new
> config 
> file. It means it was an admin error then.
> 
> Best regards,
> Jozsef
> --
> E-mail : kadlecsik.jozsef at wigner.hu
> PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt
> Address: Wigner Research Centre for Physics
>          H-1525 Budapest 114, POB. 49, Hungary
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list