[ClusterLabs] Sub‑clusters / super‑clusters - working :)

Fri Aug 6 08:41:59 EDT 2021

On Friday 06 August 2021 at 14:14:09, Andrei Borzenkov wrote:

> On Thu, Aug 5, 2021 at 3:44 PM Antony Stone wrote:
> > 
> > For anyone interested in the detail of how to do this (without needing
> > booth), here is my cluster.conf file, as in "crm configure load replace
> > cluster.conf":
> > 
> > --------
> > node tom attribute site=cityA
> > node dick attribute site=cityA
> > node harry attribute site=cityA
> > 
> > node fred attribute site=cityB
> > node george attribute site=cityB
> > node ron attribute site=cityB
> > 
> > primitive A-float IPaddr2 params ip=192.168.32.250 cidr_netmask=24 meta
> > migration-threshold=3 failure-timeout=60 op monitor interval=5 timeout=20
> > on- fail=restart
> > primitive B-float IPaddr2 params ip=192.168.42.250 cidr_netmask=24 meta
> > migration-threshold=3 failure-timeout=60 op monitor interval=5 timeout=20
> > on- fail=restart
> > primitive Asterisk asterisk meta migration-threshold=3 failure-timeout=60
> > op monitor interval=5 timeout=20 on-fail=restart
> > 
> > group GroupA A-float4  resource-stickiness=100
> > group GroupB B-float4  resource-stickiness=100
> > group Anywhere Asterisk resource-stickiness=100
> > 
> > location pref_A GroupA rule -inf: site ne cityA
> > location pref_B GroupB rule -inf: site ne cityB
> > location no_pref Anywhere rule -inf: site ne cityA and site ne cityB
> > 
> > colocation Ast 100: Anywhere [ cityA cityB ]
> 
> You define a resource set, but there are no resources cityA or cityB,
> at least you do not show them. So it is not quite clear what this
> colocation does.

Apologies - I had used different names in my test setup, and converted them to 
cityA etc for the sake of continuity in this discussion.

That should be:

	colocation Ast 100: Anywhere [ GroupA GroupB ]

> > property cib-bootstrap-options: stonith-enabled=no no-quorum-policy=stop
> 
> If connectivity between (any two) sites is lost you may end up with
> one of A or B going out of quorum.

Agreed.

> While this will stop active resources and restart them on another site,

No.  Resources do not start on the "wrong" site because of:

	location pref_A GroupA rule -inf: site ne cityA
	location pref_B GroupB rule -inf: site ne cityB

The resources in GroupA either run in cityA or they do not run at all.

> there is no coordination between stopping and starting so for some time
> resources will be active on both sites. It is up to you to evaluate whether
> this matters.

Any resource which tried to start at the wrong site would simply fail, because 
the IP addresses involved do not work at the "other" site.

> If this matters your solution does not protect against it.
> 
> If this does not matter, the usual response is - why do you need a
> cluster in the first place? Why not simply always run asterisk on both
> sites all the time?

Because Asterisk at cityA is bound to a floating IP address, which is held on 
one of the three machines in cityA.  I can't run Asterisk on all three 
machines there because only one of them has the IP address.

Asterisk _does_ normally run on both sites all the time, but only on one 
machine at each site.

> > start-failure-is-fatal=false cluster-recheck-interval=60s
> > --------
> > 
> > Of course, the group definitions are not needed for single resources, but
> > I shall in practice be using multiple resources which do need groups, so
> > I wanted to ensure I was creating something which would work with that.
> 
> > I have tested it by:
> ...
> >  - causing a network failure at one city (so it simply disappears without
> > stopping corosync neatly): the other city continues its resources (plus
> > the "anywhere" resource), the isolated city stops
> 
> If the site is completely isolated it probably does not matter whether
> anything is active there. It is partial connectivity loss where it
> becomes interesting.

Agreed, however my testing shows that resources which I want running in cityA 
are either running there or they're not (they never move to cityB or cityC), 
similarly for cityB, and the resources I want just a single instance of are 
doing just that, and on the same machine at cityA or cityB as the local 
resources are running on.

Thanks for the feedback,

Antony.

-- 
"Measuring average network latency is about as useful as measuring the mean 
temperature of patients in a hospital."

 - Stéphane Bortzmeyer

                                                   Please reply to the list;
                                                         please *don't* CC me.