[ClusterLabs] multi-site clusters vs disaster recovery clusters

Wed Feb 5 13:10:35 EST 2020

05.02.2020 18:16, Олег Самойлов пишет:
> Hi all.
> 
> I am reading the documentation about new (for me) pacemaker, which came with RedHat 8.
> 
> And I see two different chapters, which both tried to solve exactly the same problem.
> 
> One is CONFIGURING DISASTER RECOVERY CLUSTERS (pcs dr):
> 
> This is about infrastructure to create two different clusters on different sites with manual switching between them.
> 
> And CONFIGURING MULTI-SITE CLUSTERS WITH PACEMAKER (pcs booth):
> 
> This is also about the same, to create two different clusters on different sites with automatic switching, but with lack of some features from dr.
> 
> IMHO because both features is about the same, worth to unit them in documentation and as single feature. Or, may be here is a point to make them different? May be I don't understand something?

(Multi-site) high availability cluster and disaster recovery solve
entirely different problems.

HA cluster task is to automatically restore access to service with
minimal downtime. Fundamental prerequisite is that every node that can
takeover service (or resource) has access to up-to-date data for this
resource. Otherwise resource fail-over would result in silent data loss
or other inconsistencies. This severely limits maximal distance between
sites - once you go beyond several dozens kilometers, latency to
synchronously replicate data becomes too high. There could be special
workloads that tolerate it, but in general it is more or less metro area.

Disaster recovery goal is to protect against catastrophic loss of the
whole area with data and infrastructure to access this area. To minimize
disaster impact, secondary site is locate on far greater distances which
inevitably means it cannot have full up to date data. So decision to
accept data loss and continue operation is always manual. It may be more
acceptable to wait until primary site returns to operation (or try to
rescue latest data from it).

It is true that often "disaster recovery site" is located in the same
metropolitan area so stretched cluster can cover it.