[ClusterLabs] node utilization attributes are lost during upgrade

Tue Aug 18 15:45:29 EDT 2020

Won't it be easier if:
- set a node in standby
- stop a node
- remove the node
- add again with the new hostname

Best Regards,
Strahil Nikolov

На 18 август 2020 г. 17:15:49 GMT+03:00, Ken Gaillot <kgaillot at redhat.com> написа:
>On Tue, 2020-08-18 at 14:35 +0200, Kadlecsik József wrote:
>> Hi,
>> 
>> On Mon, 17 Aug 2020, Ken Gaillot wrote:
>> 
>> > On Mon, 2020-08-17 at 12:12 +0200, Kadlecsik József wrote:
>> > > 
>> > > At upgrading a corosync/pacemaker/libvirt/KVM cluster from
>> > > Debian 
>> > > stretch to buster, all the node utilization attributes were
>> > > erased 
>> > > from the configuration. However, the same attributes were kept at
>> > > the 
>> > > VirtualDomain resources. This resulted that all resources with 
>> > > utilization attributes were stopped.
>> > 
>> > Ouch :(
>> > 
>> > There are two types of node attributes, transient and permanent. 
>> > Transient attributes last only until pacemaker is next stopped on
>> > the 
>> > node, while permanent attributes persist between reboots/restarts.
>> > 
>> > If you configured the utilization attributes with crm_attribute
>> > -z/ 
>> > --utilization, it will default to permanent, but it's possible to 
>> > override that with -l/--lifetime reboot (or equivalently, -t/
>> > --type 
>> > status).
>> 
>> The attributes were defined by "crm configure edit", simply stating:
>> 
>> node 1084762113: atlas0 \
>>         utilization hv_memory=192 cpu=32 \
>>         attributes standby=off
>> ...
>> node 1084762119: atlas6 \
>>         utilization hv_memory=192 cpu=32 \
>> 
>> But I believe now that corosync caused the problem, because the nodes
>> had 
>> been renumbered:
>
>Ah yes, that would do it. Pacemaker would consider them different nodes
>with the same names. The "other" node's attributes would not apply to
>the "new" node.
>
>The upgrade procedure would be similar except that you would start
>corosync by itself after each upgrade. After all nodes were upgraded,
>you would modify the CIB on one node (while pacemaker is not running)
>with:
>
>CIB_file=/var/lib/pacemaker/cib/cib.xml cibadmin --modify --scope=nodes
>-X '...'
>
>where '...' is a <node> XML entry from the CIB with the "id" value
>changed to the new ID, and repeat that for each node. Then, start
>pacemaker on that node and wait for it to come up, then start pacemaker
>on the other nodes.
>
>> 
>> node 3232245761: atlas0
>> ...
>> node 3232245767: atlas6
>> 
>> The upgrade process was:
>> 
>> for each node do
>>     set the "hold" mark on the corosync package
>>     put the node standby
>>     wait for the resources to be migrated off
>>     upgrade from stretch to buster
>>     reboot
>>     put the node online
>>     wait for the resources to be migrated (back)
>> done
>> 
>> Up to this point all resources were running fine.
>> 
>> In order to upgrade corosync, we followed the next steps:
>> 
>> enable maintenance mode
>> stop pacemaker and corosync on all nodes
>> for each node do
>>     delete the hold mark and upgrade corosync
>>     install new config file (nodeid not specified)
>>     restart corosync, start pacemaker
>> done
>> 
>> We could see that all resources were running unmanaged. When
>> disabling the 
>> maintenance mode, then those were stopped.
>> 
>> So I think corosync renumbered the nodes and I suspect the reason for
>> that 
>> was that "clear_node_high_bit: yes" was not specified in the new
>> config 
>> file. It means it was an admin error then.
>> 
>> Best regards,
>> Jozsef
>> --
>> E-mail : kadlecsik.jozsef at wigner.hu
>> PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt
>> Address: Wigner Research Centre for Physics
>>          H-1525 Budapest 114, POB. 49, Hungary
>-- 
>Ken Gaillot <kgaillot at redhat.com>
>
>_______________________________________________
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/