[ClusterLabs] node utilization attributes are lost during upgrade
Kadlecsik József
kadlecsik.jozsef at wigner.hu
Tue Aug 18 08:35:27 EDT 2020
Hi,
On Mon, 17 Aug 2020, Ken Gaillot wrote:
> On Mon, 2020-08-17 at 12:12 +0200, Kadlecsik József wrote:
> >
> > At upgrading a corosync/pacemaker/libvirt/KVM cluster from Debian
> > stretch to buster, all the node utilization attributes were erased
> > from the configuration. However, the same attributes were kept at the
> > VirtualDomain resources. This resulted that all resources with
> > utilization attributes were stopped.
>
> Ouch :(
>
> There are two types of node attributes, transient and permanent.
> Transient attributes last only until pacemaker is next stopped on the
> node, while permanent attributes persist between reboots/restarts.
>
> If you configured the utilization attributes with crm_attribute -z/
> --utilization, it will default to permanent, but it's possible to
> override that with -l/--lifetime reboot (or equivalently, -t/--type
> status).
The attributes were defined by "crm configure edit", simply stating:
node 1084762113: atlas0 \
utilization hv_memory=192 cpu=32 \
attributes standby=off
...
node 1084762119: atlas6 \
utilization hv_memory=192 cpu=32 \
But I believe now that corosync caused the problem, because the nodes had
been renumbered:
node 3232245761: atlas0
...
node 3232245767: atlas6
The upgrade process was:
for each node do
set the "hold" mark on the corosync package
put the node standby
wait for the resources to be migrated off
upgrade from stretch to buster
reboot
put the node online
wait for the resources to be migrated (back)
done
Up to this point all resources were running fine.
In order to upgrade corosync, we followed the next steps:
enable maintenance mode
stop pacemaker and corosync on all nodes
for each node do
delete the hold mark and upgrade corosync
install new config file (nodeid not specified)
restart corosync, start pacemaker
done
We could see that all resources were running unmanaged. When disabling the
maintenance mode, then those were stopped.
So I think corosync renumbered the nodes and I suspect the reason for that
was that "clear_node_high_bit: yes" was not specified in the new config
file. It means it was an admin error then.
Best regards,
Jozsef
--
E-mail : kadlecsik.jozsef at wigner.hu
PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
H-1525 Budapest 114, POB. 49, Hungary
More information about the Users
mailing list