[ClusterLabs] Sub‑clusters / super‑clusters - working :)

Fri Aug 6 08:14:09 EDT 2021

On Thu, Aug 5, 2021 at 3:44 PM Antony Stone
<Antony.Stone at ha.open.source.it> wrote:
>
> On Thursday 05 August 2021 at 10:51:37, Antony Stone wrote:
>
> > On Thursday 05 August 2021 at 07:48:37, Ulrich Windl wrote:
> > >
> > > Have you ever tried to find out why this happens? (Talking about logs)
> >
> > Not in detail, no, but just in case there's a chance of getting this
> > working as suggested simply using location constraints, I shall look
> > further.
>
> I now have a working solution - thank you to everyone who has helped.
>
> The answer to the problem above was simple - with a 6-node cluster, 3 votes is
> not quorum.
>
> I added a 7th node (in "city C") and adjusted the location constraints to
> ensure that cluster A resources run in city A, cluster B resources run in city
> B, and the "anywhere" resource runs in either city A or city B.
>
> I've even added a colocation constraint to ensure that the "anywhere" resource
> runs on the same machine in either city A or city B as is running the local
> resources there (which wasn't a strict requirement, but is very useful).
>
> For anyone interested in the detail of how to do this (without needing booth),
> here is my cluster.conf file, as in "crm configure load replace cluster.conf":
>
> --------
> node tom attribute site=cityA
> node dick attribute site=cityA
> node harry attribute site=cityA
>
> node fred attribute site=cityB
> node george attribute site=cityB
> node ron attribute site=cityB
>
> primitive A-float IPaddr2 params ip=192.168.32.250 cidr_netmask=24 meta
> migration-threshold=3 failure-timeout=60 op monitor interval=5 timeout=20 on-
> fail=restart
> primitive B-float IPaddr2 params ip=192.168.42.250 cidr_netmask=24 meta
> migration-threshold=3 failure-timeout=60 op monitor interval=5 timeout=20 on-
> fail=restart
> primitive Asterisk asterisk meta migration-threshold=3 failure-timeout=60 op
> monitor interval=5 timeout=20 on-fail=restart
>
> group GroupA A-float4  resource-stickiness=100
> group GroupB B-float4  resource-stickiness=100
> group Anywhere Asterisk resource-stickiness=100
>
> location pref_A GroupA rule -inf: site ne cityA
> location pref_B GroupB rule -inf: site ne cityB
> location no_pref Anywhere rule -inf: site ne cityA and site ne cityB
>
> colocation Ast 100: Anywhere [ cityA cityB ]
>

You define a resource set, but there are no resources cityA or cityB,
at least you do not show them. So it is not quite clear what this
colocation does.

> property cib-bootstrap-options: stonith-enabled=no no-quorum-policy=stop

If connectivity between (any two) sites is lost you may end up with
one of A or B going out of quorum. While this will stop active
resources and restart them on another site, there is no coordination
between stopping and starting so for some time resources will be
active on both sites. It is up to you to evaluate whether this
matters.

If this matters your solution does not protect against it.

If this does not matter, the usual response is - why do you need a
cluster in the first place? Why not simply always run asterisk on both
sites all the time?

> start-failure-is-fatal=false cluster-recheck-interval=60s
> --------
>
> Of course, the group definitions are not needed for single resources, but I
> shall in practice be using multiple resources which do need groups, so I
> wanted to ensure I was creating something which would work with that.
>
> I have tested it by:
>
...
>  - causing a network failure at one city (so it simply disappears without
> stopping corosync neatly): the other city continues its resources (plus the
> "anywhere" resource), the isolated city stops
>

If the site is completely isolated it probably does not matter whether
anything is active there. It is partial connectivity loss where it
becomes interesting.