[ClusterLabs] node utilization attributes are lost during upgrade

Tue Aug 18 08:35:27 EDT 2020

Hi,

On Mon, 17 Aug 2020, Ken Gaillot wrote:

> On Mon, 2020-08-17 at 12:12 +0200, Kadlecsik József wrote:
> > 
> > At upgrading a corosync/pacemaker/libvirt/KVM cluster from Debian 
> > stretch to buster, all the node utilization attributes were erased 
> > from the configuration. However, the same attributes were kept at the 
> > VirtualDomain resources. This resulted that all resources with 
> > utilization attributes were stopped.
> 
> Ouch :(
> 
> There are two types of node attributes, transient and permanent. 
> Transient attributes last only until pacemaker is next stopped on the 
> node, while permanent attributes persist between reboots/restarts.
> 
> If you configured the utilization attributes with crm_attribute -z/ 
> --utilization, it will default to permanent, but it's possible to 
> override that with -l/--lifetime reboot (or equivalently, -t/--type 
> status).

The attributes were defined by "crm configure edit", simply stating:

node 1084762113: atlas0 \
        utilization hv_memory=192 cpu=32 \
        attributes standby=off
...
node 1084762119: atlas6 \
        utilization hv_memory=192 cpu=32 \

But I believe now that corosync caused the problem, because the nodes had 
been renumbered:

node 3232245761: atlas0
...
node 3232245767: atlas6

The upgrade process was:

for each node do
    set the "hold" mark on the corosync package
    put the node standby
    wait for the resources to be migrated off
    upgrade from stretch to buster
    reboot
    put the node online
    wait for the resources to be migrated (back)
done

Up to this point all resources were running fine.

In order to upgrade corosync, we followed the next steps:

enable maintenance mode
stop pacemaker and corosync on all nodes
for each node do
    delete the hold mark and upgrade corosync
    install new config file (nodeid not specified)
    restart corosync, start pacemaker
done

We could see that all resources were running unmanaged. When disabling the 
maintenance mode, then those were stopped.

So I think corosync renumbered the nodes and I suspect the reason for that 
was that "clear_node_high_bit: yes" was not specified in the new config 
file. It means it was an admin error then.

Best regards,
Jozsef
--
E-mail : kadlecsik.jozsef at wigner.hu
PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
         H-1525 Budapest 114, POB. 49, Hungary