[ClusterLabs] Data Centers and Pacemaker

Thu Nov 14 21:37:11 UTC 2024

Hello,

I got a long message. I am not sure if really appropriate for the pacemaker group but you guys got a lot of experience... Happy to be told not appropiate.

We have two data centers connected by multitple dark fibre about 13km apart. Latency is about 0.2-0.3 ms. Forget if round trip or one way. I am not convinced having an extra fibre via an independent provided to make redundant connection between the two sites would be financially possible. Plus there is a view we should not rely on such things. We would like to move to more "modern" tech that has clustering builtin, commodity hardware and at somepoint a 3rd data center.

Initially was thinking just to move floating ip addresses between the two sites and run synchronous db syncs (its a SAP installation but not that relevant i think).  But this view/approach does not working anymore. Then I realised as a company we should not rely as said above.

So instead we would DB sync within a a site and async between sites.

To do this is a bit more in complexity as i need to have odd number of nodes in each side and ensure have no single point of failure on both sides)
As an optimisation would be ok one to be be mini with out single point of failure and just switch back to the main site if you know I mean in the case of failover/take over to the mini site.
(main site would be then 5 nodes in a pacemaker cluster and the mini site 3 nodes in the cluster)

I do not worry about scaling out we can just add nodes 2 at time at both sites.

Failover  in the site would be automated using pacemaker eventually and planned takeovers we could do between sites by telling pacemaker manually on both sides what to do. Obviously we test this all out, certified etc.

I hope i can use the term failover i.e unplanned and takeover planned. 🙂 Our initial goal is reduced planned downtime to zero (we do not have that now for upgrades and patching etc) and to move to RPO 0 and minimal RTO.

As we do not have real redudnant networks being dependent on quorum devices is not so good as if the quorum device is lost the whole cluster goes down. And as I understand it you can only have one quorum device. So thats a SPOF. So instead i have odd numbers of nodes in the pacemaker cluster in each datacenter. For me thats ok and somehow i think better than quorum devices.

We use Vmware (sigh...) and NetApp

In terms of fencing we are trying to fence using industry standards e.g not going to the management console of vmware. But more standard protocols e.g in shared storage.  I think I can make a good case for self fencing using watchdog as I understand this is the minimal that SBD needs. I found that statement on a page on the old clusterlabs website i have not looked at the new.

So what  are my questions

  1.
Am I right the quorum device is a single point of failure? Just out of interest
  2.
 If we ever want to some how automate or semi automate using Booth between data centeres, is this a good idea. I looked a bit for documentation on booth, I should harder. But from gut feel is Booth possible. Is there any alternative.
  3.
Is watchdog only fencing using sbd the absolute mininum.
  4.
Would you recommend in addition to watchdog to do resource fencing i.e take the storage away, pull the ethernet cable away virtually (not sure how that works though). Or just node fencing in addition to watchdog via some defined way.
  5.
Using the shared storage in sbd, fot poison pills does that given me really anything. I cant justify to myself if it does. Does is give anything else except poison pills?
  6.
Have I forgotten a topic 😉

Sorry for typos and grammar mistakes, it is late over her.

regards
Angelo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20241114/78a71640/attachment.htm>