[ClusterLabs] Antw: [EXT] Re: Disable all resources in a group if one or more of them fail and are unable to reactivate

Igor Tverdovskiy igor.tverdovskiy.pe at gmail.com
Fri Jan 29 00:11:51 EST 2021


Hi Damiano,

As a workaround I can suggest the following solution (in short, add
infinite score colocation rule for the first resource in the group with the
last resource in the group).

*How it works:*
After that it will work as you expect:
1) If at least one resource in the group fails the group will fail all
resources which are located below the affected resource
2) If the last resource of the group is stopped the first resource of the
group will be stopped as well
3) If the first resource of the group is stopped all the rest running
resources will be stopped as well

Details:

*Here is entire configuration:*

> sudo crm configure show
> node 2: tt738741-ip1 \
>         attributes cloud-mode=0 site=Main webdispatcher-192.168.3.211=1
> node 3: tt738741-ip2 \
>         attributes cloud-mode=0 site=Main webdispatcher-192.168.3.211=1
> primitive haproxy-192.168.3.211 ocf:my:haproxy \
>         params vipaddress=192.168.3.211 \
>         op monitor interval=10s timeout=30s \
>         op start interval=0 timeout=30s \
>         op stop interval=0 timeout=60s \
>         meta failure-timeout=120s migration-threshold=3 target-role=Started
> primitive vip-192.168.3.211 ocf:my:IPaddr2 \
>         params ip=192.168.3.211 iflabel=wd cidr_netmask=28 \
>         op monitor interval=10s timeout=20s \
>         op start interval=0 timeout=30s \
>         op stop interval=0 timeout=30s \
>         meta failure-timeout=120s migration-threshold=3 target-role=Started
> primitive vip-alias-192.168.3.216 ocf:my:IPaddr2 \
>         params ip=192.168.3.216 iflabel=wd cidr_netmask=28 \
>         op monitor interval=10s timeout=20s \
>         op start interval=0 timeout=30s \
>         op stop interval=0 timeout=30s \
>         meta failure-timeout=120s migration-threshold=3
> group grp_webdispatcher-192.168.3.211 vip-192.168.3.211
> vip-alias-192.168.3.216 haproxy-192.168.3.211 \
>         meta
> location allow-grp_webdispatcher-192.168.3.211
> grp_webdispatcher-192.168.3.211 \
>         rule 100: webdispatcher-192.168.3.211 eq 1
> colocation colocate-group-resources inf: vip-192.168.3.211
> haproxy-192.168.3.211
> location deny-grp_webdispatcher-192.168.3.211
> grp_webdispatcher-192.168.3.211 \
>         rule -inf: webdispatcher-192.168.3.211 ne 1
> property cib-bootstrap-options: \
>         have-watchdog=false \
>         dc-version=1.1.15-11.el7-e174ec8 \
>         cluster-infrastructure=corosync \
>         stonith-enabled=false \
>         no-quorum-policy=ignore \
>         symmetric-cluster=false \
>         default-resource-stickiness=1 \
>         cluster-recheck-interval=60s \
>         dc-deadtime=30 \
>         pe-input-series-max=5000 \
>         pe-error-series-max=5000 \
>         pe-warn-series-max=5000 \
>         shutdown-escalation=2min \
>         maintenance-mode=false \
>         last-lrm-refresh=1563900766
> rsc_defaults rsc_defaults-options:



*Here is how colocation rule looks like in XML format:*

> <constraints>
>   <rsc_colocation id="colocate-group-resources" rsc="vip-192.168.3.211"
> with-rsc="haproxy-192.168.3.211" score="INFINITY"/>
> </constraints>


*Where*:
"colocate-group-resources" - any string id for the rule you like
"vip-192.168.3.211" - id of the first resource in the group
"haproxy-192.168.3.211" - id of the last resource in the group
score="INFINITY" - means do not tolerate if rule is not satisfied and stop
vip-192.168.3.211 if haproxy-192.168.3.211 is not running

*see*
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#_colocation_properties


*How to add:*
1) Via "sudo crm configure edit"
- insert "colocation colocate-group-resources inf: vip-192.168.3.211
haproxy-192.168.3.211" after groups declaration
- save file and exit
- you'll see warnings, but it still works:
WARNING: colocate-group-resources: resource vip-192.168.3.211 is grouped,
constraints should apply to the group
WARNING: colocate-group-resources: resource haproxy-192.168.3.211 is
grouped, constraints should apply to the group

2) Via cibadmin by replacing cib.xml(crm configure edit xml didn't work for
me)
sudo cibadmin -Q > ~/orig.cib.xml
cp ~/orig.cib.xml ~/new.cib.xml
head -1 ~/new.cib.xml | grep admin_epoch
# in your editor
# increase admin_epoch by 1
# and insert the above XML into <constraints>..</constraints> tag if exists
or create a new one after </resources> tag.
nano ~/new.cib.xml

# replace current CIB (cluster information base) with a new one:
sudo cibadmin --replace --xml-file ~/new.cib.xml


*Test results:*

> sudo crm status
> ...
>  Resource Group: grp_webdispatcher-192.168.3.211
>      vip-192.168.3.211  (ocf::my:IPaddr2):        *Started* tt738741-ip1
>      vip-alias-192.168.3.216    (ocf::my:IPaddr2):        *Started*
> tt738741-ip1
>      haproxy-192.168.3.211      (ocf::my:haproxy):        *Started*
> tt738741-ip1



> sudo crm resource *stop* haproxy-192.168.3.211

> sudo crm status
> ...
>  Resource Group: grp_webdispatcher-192.168.3.211
>      vip-192.168.3.211  (ocf::my:IPaddr2):        *Stopped*
>      vip-alias-192.168.3.216    (ocf::my:IPaddr2):        *Stopped*
>      haproxy-192.168.3.211      (ocf::my:haproxy):        *Stopped*
> (disabled)




> sudo crm resource *start* haproxy-192.168.3.211


> sudo crm status
> ...
>  Resource Group: grp_webdispatcher-192.168.3.211
>      vip-192.168.3.211  (ocf::my:IPaddr2):        *Started* tt738741-ip1
>      vip-alias-192.168.3.216    (ocf::my:IPaddr2):        *Started*
> tt738741-ip1
>      haproxy-192.168.3.211      (ocf::my:haproxy):        *Started*
> tt738741-ip1


Regards,
Igor


On Thu, Jan 28, 2021 at 8:31 PM Ken Gaillot <kgaillot at redhat.com> wrote:

> I've opened a feature request for this:
>
> https://bugs.clusterlabs.org/show_bug.cgi?id=5465
>
> Realistically, developer time is tight for the foreseeable future, so
> it's more a wish-list item unless someone volunteers to work on it.
>
> On Thu, 2021-01-28 at 17:42 +0100, damiano giuliani wrote:
> > Hi Ulrich, thanks for the answer,
> > as Ken explained me, there isnt any way to prevent earlier members
> > from running
> > if a later member has no available node,
> > if no node is available for the failed member, then it will just
> > remain
> > stopped,and the earlier
> > members will stay active where they are.
> > i really hope was a solution or workaorund for this, but as ken
> > clarify, pacemaker cant hadle this exceptions.
> >
> > Many thanks for your quick and effective support.
> >
> > Have a good evening!
> >
> > Damiano
> >
> >
> > Il giorno gio 28 gen 2021 alle ore 11:15 Ulrich Windl <
> > Ulrich.Windl at rz.uni-regensburg.de> ha scritto:
> > > >>> damiano giuliani <damianogiuliani87 at gmail.com> schrieb am
> > > 27.01.2021 um
> > > 19:25
> > > in Nachricht
> > > <CAG=zYNOx-R=wKbhtm=4N7qaoYKE=ofORVQ7jA0jr17oYjgqOhQ at mail.gmail.com
> > > >:
> > > > Hi Andrei, Thanks for ur help.
> > > > if one of my resource in the group  fails or the primary node
> > > went down (
> > > > in my case acspcmk-02 ), the probe notices it and pacemaker tries
> > > to
> > > > restart the whole resource group on the second node.
> > > > if the second node cant run one of my grouped resources, it tries
> > > to stop
> > > > them.
> > >
> > > And what exactly is what you want? The behavior described it how
> > > the cluster
> > > handles it normally.
> > >
> > > >
> > > >
> > > > i attached my cluster status; my primary node ( acspcmk-02 )
> > > fails and the
> > > > resource group tries to restart on the acspcmk-01, i keep broken
> > > the
> > > > resource  "lta-subscription-backend-ope-s3" on purpose and as you
> > > can see
> > > > some grouped resources are still started..
> > > > i would like to know how achive a  condition that the resource
> > > group must
> > > > start properly for each resources, if not stop all the group
> > > without some
> > > > services still up and running.
> > > >
> > > >
> > > > 2 nodes configured
> > > > 28 resources configured
> > > >
> > > > Online: [ acspcmk-01 ]
> > > > OFFLINE: [ acspcmk-02 ]
> > > >
> > > > Full list of resources:
> > > >
> > > >  Clone Set: lta-odata-frontend-ope-s1-clone [lta-odata-frontend-
> > > ope-s1]
> > > >      Started: [ acspcmk-01 ]
> > > >      Stopped: [ acspcmk-02 ]
> > > >  Clone Set: lta-odata-frontend-ope-s2-clone [lta-odata-frontend-
> > > ope-s2]
> > > >      Started: [ acspcmk-01 ]
> > > >      Stopped: [ acspcmk-02 ]
> > > >  Clone Set: lta-odata-frontend-ope-s3-clone [lta-odata-frontend-
> > > ope-s3]
> > > >      Started: [ acspcmk-01 ]
> > > >      Stopped: [ acspcmk-02 ]
> > > >  Clone Set: s1ltaestimationtime-clone [s1ltaestimationtime]
> > > >      Started: [ acspcmk-01 ]
> > > >      Stopped: [ acspcmk-02 ]
> > > >  Clone Set: s2ltaestimationtime-clone [s2ltaestimationtime]
> > > >      Started: [ acspcmk-01 ]
> > > >      Stopped: [ acspcmk-02 ]
> > > >  Clone Set: s3ltaestimationtime-clone [s3ltaestimationtime]
> > > >      Started: [ acspcmk-01 ]
> > > >      Stopped: [ acspcmk-02 ]
> > > >  Clone Set: openresty-clone [openresty]
> > > >      Started: [ acspcmk-01 ]
> > > >      Stopped: [ acspcmk-02 ]
> > > >  Resource Group: LTA_SINGLE_RESOURCES
> > > >      VIP        (ocf::heartbeat:IPaddr2):       Started acspcmk-
> > > 01
> > > >      lta-subscription-backend-ope-s1
> > > >  (systemd:lta-subscription-backend-ope-s1):      Started acspcmk-
> > > 01
> > > >      lta-subscription-backend-ope-s2
> > > >  (systemd:lta-subscription-backend-ope-s2):      Started acspcmk-
> > > 01
> > > >      lta-subscription-backend-ope-s3
> > > >  (systemd:lta-subscription-backend-ope-s3):      Stopped
> > > >      s1ltaquotaservice  (systemd:s1ltaquotaservice):    Stopped
> > > >      s2ltaquotaservice  (systemd:s2ltaquotaservice):    Stopped
> > > >      s3ltaquotaservice  (systemd:s3ltaquotaservice):    Stopped
> > > >      s1ltarolling       (systemd:s1ltarolling): Stopped
> > > >      s2ltarolling       (systemd:s2ltarolling): Stopped
> > > >      s3ltarolling       (systemd:s3ltarolling): Stopped
> > > >      s1srvnotificationdispatcher
> > > >  (systemd:s1srvnotificationdispatcher):  Stopped
> > > >      s2srvnotificationdispatcher
> > > >  (systemd:s2srvnotificationdispatcher):  Stopped
> > > >      s3srvnotificationdispatcher
> > > >  (systemd:s3srvnotificationdispatcher):  Stopped
> > > >
> > > > Failed Resource Actions:
> > > > * lta-subscription-backend-ope-s3_start_0 on acspcmk-01 'unknown
> > > error'
> > > > (1): call=466, status=complete, exitreason='',
> > > >     last-rc-change='Wed Jan 27 13:00:21 2021', queued=0ms,
> > > exec=2128ms
> > > >
> > > > Daemon Status:
> > > >   corosync: active/disabled
> > > >   pacemaker: active/disabled
> > > >   pcsd: active/enabled
> > > >   sbd: active/enabled
> > > >
> > > >
> > > >   I hope i explained my problem at my best,
> > > >
> > > > Thanks for your time and help.
> > > >
> > > > Good Evening
> > > >
> > > > Damiano
> > > >
> > > > Il giorno mer 27 gen 2021 alle ore 19:03 Andrei Borzenkov <
> > > > arvidjaar at gmail.com> ha scritto:
> > > >
> > > >> 27.01.2021 19:06, damiano giuliani пишет:
> > > >> > Hi all im pretty new to the clusters, im struggling trying to
> > > configure
> > > a
> > > >> > bounch of resources and test how they failover.my need is to
> > > start and
> > > >> > manage a group of resources as one (in order to archive this a
> > > resource
> > > >> > group has been created), and if one of them cant run and still
> > > fails,
> > > the
> > > >> > cluster will try to restart the resource group in the
> > > secondary node, if
> > > >> it
> > > >> > cant run the all the resource toghter disable all the resource
> > > group.
> > > >> > i would like to know if there is a way to set the cluster to
> > > disable all
> > > >> > the resources of the group (or the group itself) if it cant be
> > > run all
> > > >> the
> > > >> > resoruces somewhere.
> > > >> >
> > > >>
> > > >> That's what pacemaker group does. I am not sure what you mean
> > > with
> > > >> "disable all resources". If resource fail count on a node
> > > exceeds
> > > >> threshold, this node is banned from running resource. If
> > > resource failed
> > > >> on every node, no node can run it until you clear fail count.
> > > >>
> > > >> "Disable resource" in pacemaker would mean setting its target-
> > > role to
> > > >> stopped. That does not happen automatically (at least I am not
> > > aware of
> > > >> it).
> > > >> _______________________________________________
> > > >> Manage your subscription:
> > > >> https://lists.clusterlabs.org/mailman/listinfo/users
> > > >>
> > > >> ClusterLabs home: https://www.clusterlabs.org/
> > > >>
> > >
> > >
> > >
> > > _______________________________________________
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > ClusterLabs home: https://www.clusterlabs.org/
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20210129/43f3f195/attachment-0001.htm>


More information about the Users mailing list