[ClusterLabs] node utilization attributes are lost during upgrade
Strahil Nikolov
hunter86_bg at yahoo.com
Tue Aug 18 15:45:29 EDT 2020
Won't it be easier if:
- set a node in standby
- stop a node
- remove the node
- add again with the new hostname
Best Regards,
Strahil Nikolov
На 18 август 2020 г. 17:15:49 GMT+03:00, Ken Gaillot <kgaillot at redhat.com> написа:
>On Tue, 2020-08-18 at 14:35 +0200, Kadlecsik József wrote:
>> Hi,
>>
>> On Mon, 17 Aug 2020, Ken Gaillot wrote:
>>
>> > On Mon, 2020-08-17 at 12:12 +0200, Kadlecsik József wrote:
>> > >
>> > > At upgrading a corosync/pacemaker/libvirt/KVM cluster from
>> > > Debian
>> > > stretch to buster, all the node utilization attributes were
>> > > erased
>> > > from the configuration. However, the same attributes were kept at
>> > > the
>> > > VirtualDomain resources. This resulted that all resources with
>> > > utilization attributes were stopped.
>> >
>> > Ouch :(
>> >
>> > There are two types of node attributes, transient and permanent.
>> > Transient attributes last only until pacemaker is next stopped on
>> > the
>> > node, while permanent attributes persist between reboots/restarts.
>> >
>> > If you configured the utilization attributes with crm_attribute
>> > -z/
>> > --utilization, it will default to permanent, but it's possible to
>> > override that with -l/--lifetime reboot (or equivalently, -t/
>> > --type
>> > status).
>>
>> The attributes were defined by "crm configure edit", simply stating:
>>
>> node 1084762113: atlas0 \
>> utilization hv_memory=192 cpu=32 \
>> attributes standby=off
>> ...
>> node 1084762119: atlas6 \
>> utilization hv_memory=192 cpu=32 \
>>
>> But I believe now that corosync caused the problem, because the nodes
>> had
>> been renumbered:
>
>Ah yes, that would do it. Pacemaker would consider them different nodes
>with the same names. The "other" node's attributes would not apply to
>the "new" node.
>
>The upgrade procedure would be similar except that you would start
>corosync by itself after each upgrade. After all nodes were upgraded,
>you would modify the CIB on one node (while pacemaker is not running)
>with:
>
>CIB_file=/var/lib/pacemaker/cib/cib.xml cibadmin --modify --scope=nodes
>-X '...'
>
>where '...' is a <node> XML entry from the CIB with the "id" value
>changed to the new ID, and repeat that for each node. Then, start
>pacemaker on that node and wait for it to come up, then start pacemaker
>on the other nodes.
>
>>
>> node 3232245761: atlas0
>> ...
>> node 3232245767: atlas6
>>
>> The upgrade process was:
>>
>> for each node do
>> set the "hold" mark on the corosync package
>> put the node standby
>> wait for the resources to be migrated off
>> upgrade from stretch to buster
>> reboot
>> put the node online
>> wait for the resources to be migrated (back)
>> done
>>
>> Up to this point all resources were running fine.
>>
>> In order to upgrade corosync, we followed the next steps:
>>
>> enable maintenance mode
>> stop pacemaker and corosync on all nodes
>> for each node do
>> delete the hold mark and upgrade corosync
>> install new config file (nodeid not specified)
>> restart corosync, start pacemaker
>> done
>>
>> We could see that all resources were running unmanaged. When
>> disabling the
>> maintenance mode, then those were stopped.
>>
>> So I think corosync renumbered the nodes and I suspect the reason for
>> that
>> was that "clear_node_high_bit: yes" was not specified in the new
>> config
>> file. It means it was an admin error then.
>>
>> Best regards,
>> Jozsef
>> --
>> E-mail : kadlecsik.jozsef at wigner.hu
>> PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt
>> Address: Wigner Research Centre for Physics
>> H-1525 Budapest 114, POB. 49, Hungary
>--
>Ken Gaillot <kgaillot at redhat.com>
>
>_______________________________________________
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list