[ClusterLabs] Disabled resources after parallel removing of group

Miroslav Lisik mlisik at redhat.com
Wed May 22 04:21:49 EDT 2024


Hi,
see comments inline.

On 5/17/24 17:46, Александр Руденко wrote:
> Miroslav, thank you!
> 
> It helps me understand that it's not a configuration issue.
> 
> BTW, is it okay to create new resources in parallel?

Same as with parallel 'remove' operations it is not safe to do parallel
'create' operations, although it may work in some cases.

The 'pcs resource create' updates CIB by using CIB diffs and cibadmin's
'--patch' option, which is different from 'pcs resource remove', where
combination of '--replace' and '--delete' is used.

There is still risk that cib patch will not apply or something will
break due to parallel actions.

Do not use pcs command in parallel on live cluster, rather modify cib
file using pcs '-f' option and then push cib configuration to a cluster:
pcs cluster cib-push <filename>
OR
pcs cluster cib-ppush <filename> diff-against=<filename_original>

The difference in this two commands is in a method how cib update is
applied. The first command uses cibadmin's '--replace' option and the
second uses '--patch' option.

> On timeline it looks like:
> 
> pcs resource create resA1 .... --group groupA
> pcs resource create resB1 .... --group groupB
> resA1 Started
> pcs resource create resA2 .... --group groupA
> res B1 Started
> pcs resource create resB2 .... --group groupB
> res A2 Started
> res B2 Started
> 
> For now, it works okay)
> 
> In our case, cluster events like 'create' and 'remove' are generated by 
> users, and for now we don't have any queue for operations. But now, I 
> realized that we need a queue for 'remove' operations. Maybe we need a 
> queue for 'create' operations to?

Yes, it is better to prevent users from doing modify operations at the
same time.

> 
> пт, 17 мая 2024 г. в 17:49, Miroslav Lisik <mlisik at redhat.com 
> <mailto:mlisik at redhat.com>>:
> 
>     Hi Aleksandr!
> 
>     It is not safe to use `pcs resource remove` command in parallel because
>     you run into the same issues as you already described. Processes run by
>     remove command are not synchronized.
> 
>     Unfortunately, remove command does not support more than one resource
>     yet.
> 
>     If you really need to remove resources at once you can use this method:
>     1. get the current cib configuration:
>     pcs cluster cib > original.xml
> 
>     2. create a new copy of the file:
>     cp original.xml new.xml
> 
>     3. disable all to be removed resources using -f option and new
>     configuration file:
>     pcs -f new.xml resource disable <resource id>...
> 
>     4. remove resources using -f option and new configuration file:
>     pcs -f new.xml resource remove <resource id>
>     ...
> 
>     5. push new cib configuration to the cluster
>     pcs cluster cib-push new.xml diff-against=original.xml
> 
> 
>     On 5/17/24 13:47, Александр Руденко wrote:
>      > Hi!
>      >
>      > I am new in the pacemaker world, and I, unfortunately, have problems
>      > with simple actions like group removal. Please, help me
>     understand when
>      > I'm wrong.
>      >
>      > For simplicity I will use standard resources like IPaddr2 (but we
>     have
>      > this problem on any type of our custom resources).
>      >
>      > I have 5 groups like this:
>      >
>      > Full List of Resources:
>      >    * Resource Group: group-1:
>      >      * ip-11 (ocf::heartbeat:IPaddr2): Started vdc16
>      >      * ip-12 (ocf::heartbeat:IPaddr2): Started vdc16
>      >    * Resource Group: group-2:
>      >      * ip-21 (ocf::heartbeat:IPaddr2): Started vdc17
>      >      * ip-22 (ocf::heartbeat:IPaddr2): Started vdc17
>      >    * Resource Group: group-3:
>      >      * ip-31 (ocf::heartbeat:IPaddr2): Started vdc18
>      >      * ip-32 (ocf::heartbeat:IPaddr2): Started vdc18
>      >    * Resource Group: group-4:
>      >      * ip-41 (ocf::heartbeat:IPaddr2): Started vdc16
>      >      * ip-42 (ocf::heartbeat:IPaddr2): Started vdc16
>      >
>      > Groups were created by next simple script:
>      > cat groups.sh
>      > pcs resource create ip-11 ocf:heartbeat:IPaddr2 ip=10.7.1.11
>      > cidr_netmask=24 nic=lo op monitor interval=10s --group group-1
>      > pcs resource create ip-12 ocf:heartbeat:IPaddr2 ip=10.7.1.12
>      > cidr_netmask=24 nic=lo op monitor interval=10s --group group-1
>      >
>      > pcs resource create ip-21 ocf:heartbeat:IPaddr2 ip=10.7.1.21
>      > cidr_netmask=24 nic=lo op monitor interval=10s --group group-2
>      > pcs resource create ip-22 ocf:heartbeat:IPaddr2 ip=10.7.1.22
>      > cidr_netmask=24 nic=lo op monitor interval=10s --group group-2
>      >
>      > pcs resource create ip-31 ocf:heartbeat:IPaddr2 ip=10.7.1.31
>      > cidr_netmask=24 nic=lo op monitor interval=10s --group group-3
>      > pcs resource create ip-32 ocf:heartbeat:IPaddr2 ip=10.7.1.32
>      > cidr_netmask=24 nic=lo op monitor interval=10s --group group-3
>      >
>      > pcs resource create ip-41 ocf:heartbeat:IPaddr2 ip=10.7.1.41
>      > cidr_netmask=24 nic=lo op monitor interval=10s --group group-4
>      > pcs resource create ip-42 ocf:heartbeat:IPaddr2 ip=10.7.1.42
>      > cidr_netmask=24 nic=lo op monitor interval=10s --group group-4
>      >
>      > Next, i try to remove all of these group in 'parallel':
>      > cat remove.sh
>      > pcs resource remove group-1 &
>      > sleep 0.2
>      > pcs resource remove group-2 &
>      > sleep 0.2
>      > pcs resource remove group-3 &
>      > sleep 0.2
>      > pcs resource remove group-4 &
>      >
>      > After this, every time I have a few resources in some groups
>     which were
>      > not removed. It looks like:
>      >
>      > Full List of Resources:
>      >    * Resource Group: group-2 (disabled):
>      >      * ip-21 (ocf::heartbeat:IPaddr2): Stopped (disabled)
>      >    * Resource Group: group-4 (disabled):
>      >      * ip-41 (ocf::heartbeat:IPaddr2): Stopped (disabled)
>      >
>      > In logs, I can see success stopping all resources, but after
>     stopping
>      > some resources it looks like pacemaker just 'forgot' about
>     deletion and
>      > didn't.
>      >
>      > Cluster name: pacemaker1
>      > Cluster Summary:
>      >    * Stack: corosync
>      >    * Current DC: vdc16 (version 2.1.0-8.el8-7c3f660707) -
>     partition with
>      > quorum
>      >    * Last updated: Fri May 17 14:30:14 2024
>      >    * Last change:  Fri May 17 14:30:05 2024 by root via cibadmin
>     on vdc16
>      >    * 3 nodes configured
>      >    * 2 resource instances configured (2 DISABLED)
>      >
>      > Node List:
>      >    * Online: [ vdc16 vdc17 vdc18 ]
>      >
>      > Host OS is CentOS 8.4. Cluster with default settings.
>     vdc16,vdc17,vdc18
>      > are VMs with 4 vCPU.
>      >
>      >
>      > _______________________________________________
>      > Manage your subscription:
>      > https://lists.clusterlabs.org/mailman/listinfo/users
>     <https://lists.clusterlabs.org/mailman/listinfo/users>
>      >
>      > ClusterLabs home: https://www.clusterlabs.org/
>     <https://www.clusterlabs.org/>
> 
>     _______________________________________________
>     Manage your subscription:
>     https://lists.clusterlabs.org/mailman/listinfo/users
>     <https://lists.clusterlabs.org/mailman/listinfo/users>
> 
>     ClusterLabs home: https://www.clusterlabs.org/
>     <https://www.clusterlabs.org/>
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/



More information about the Users mailing list