[Pacemaker] Split Site 2-way clusters

Miki Shapiro Miki.Shapiro at coles.com.au
Wed Jan 13 19:40:45 EST 2010


When you suggest:
>>> What about setting no-quorum-policy to freeze and making the third node a full cluster member (that just doesn't run any resources)?
That way, if you get a 1-1-1 split the nodes will leave all services running where they were and while it waits for quorum.
And if it heals into a 1-2 split, then the majority will terminate the rogue node and acquire all the services.

No-quorum-policy 'Freeze' rather than 'Stop' pretty much ASSURES me of getting a split brain for my fileserver cluster. Sounds like the last thing I want to have. Any data local clients write to the cutoff node (and its DRBD split-brain volume) cannot be later reconciled and will need to be discarded. I'd rather not give the local clients that can still reach that node a false sense of security of their data having been written to disk (to a drbd volume that will be blown away and resynced with the quorum side once connectivity is re-established). 'Stop' policy sounds safer.
Question 1: Am I missing anything re stop/ignore no-quorum policy?

Further, I'm having more trouble working out a list of tabulated failure modes for this 3-way scenario, where 3-way outage-prone WAN links get introduced.
Question 2: If one WAN link is broken - (A) can speak to (B), (B) can speak to (C), but (A) CANNOT speak to (C), what drives the quorum decision and what would happen? In particular, what would happen if the node that can see both is the DC?

Question 3: Just to verify I got this right - what drives pacemaker's STONITH events,
[a] RESOURCE monitoring failure,
or
[b] CRM's crosstalk that establishes quorum-state / DC-election?


From: Andrew Beekhof [mailto:andrew at beekhof.net]
Sent: Wednesday, 13 January 2010 7:24 PM
To: pacemaker at oss.clusterlabs.org
Subject: Re: [Pacemaker] Split Site 2-way clusters


On Wed, Jan 13, 2010 at 8:19 AM, Miki Shapiro <Miki.Shapiro at coles.com.au<mailto:Miki.Shapiro at coles.com.au>> wrote:
Separate to my earlier post re CRM DC election in a 2-way cluster, I'm chasing up the (separate) issue of making the cluster a CROSS-SITE one.

As stated in yay other thread, I'm running a 2-way quorum-agnostic cluster on a SLES11, openais, pacemaker, drbd (... clvm, ocfs2, ctdb, nfs, etc) on HP Blades.

A few old threads (with a rather elaborate answer from Lars) indicate that as of March 2009 split-site wasn't yet thoroughly supported as WAN connectivity issues were not thoroughly addressed, and that as of then quorumd was not yet sufficiently robust/tested/PROD-ready.

What we decided we want to do is rely on an extremely simple (and hopefully by inference predictable and reliable) arbitrator - a THIRD linux server that lives at a SEPARATE THIRD site altogether with no special HA-related daemons running on it.

I'll build a STONITH ocf script, configure it as a cloned STONITH resource running on both nodes, and it will do roughly this when pinging the other node (via either one or two redundant links) will fail:

ssh arbitrator mkdir /tmp/$clustername && shoot-other-node || hard-suicide-NOW

Thus, when split, the nodes race to the arbitrator.
First to run the mkdir command on the arbitrator (and get rc=0) wins, gets the long straw and lives.  Loser gets shot (either by its peer if WAN allows peer to communicate with soon-to-be-dead node's iLO or by said node sealing its own fate).

Primary failure modes not accounted for by a run-of-the-mill non-split-site cluster are thus:


1.       One node cut off - cutoff node will fail the race and suicide. Good node will succeed and proceed to provide service.

2.       Nodes cut off from each other but can both access the arbitrator - slower node will suicide. Faster node will succeed and proceed to provide service.

3.       Both nodes are cut off, or the comms issue affects both node1<->node2 comms AND all ->arbitrator comms (double failure).  - both nodes suicide (and potentially leave me with two inconsistent and potentially corrupt filesystems). Can't see an easy way around this one (can anyone?)

Basically thats the part that the stuff we haven't written yet is supposed to address.

You want to avoid the "|| hard-suicide-NOW" part of your logic, but you can't safely do that unless there is some way to stop the services on the non-connected node(s) - preferably _really_ quickly.

What about setting no-quorum-policy to freeze and making the third node a full cluster member (that just doesn't run any resources)?
That way, if you get a 1-1-1 split the nodes will leave all services running where they were and while it waits for quorum.
And if it heals into a 1-2 split, then the majority will terminate the rogue node and acquire all the services.

The biggest problem is the reliability of your links and stonith devices - give particular thought to how you'd fence _node_ A if comms to _site_ A are down....



Looks to me like this can easily be implemented without any fancy quorum servers (on top of the required little ocf script and the existence of the arbitrator)

Does anyone have thoughts on this? Am I ignoring any major issues, or reinventing the wheel, or should this this potentially work as I think it will?

Thanks! :)

And a little addendum which just occurred to me re transient WAN network issues:



1.       Transient big (>2min) network issues will land me with a cluster that needs a human to turn on one node on every time they happen. Bad.



My proposed solution: classify a peer-failure as a WAN-problem by pinging peer node's core router when peer node appears dead, if router dead too touch a WAN-problem-flagfile, and so long as the flag-file sits there the survivor pings (done via ocf ping resource) other-side-router until it comes online, then shooting a "check-power-status && O-GOD-IT-STILL-LIVES-KILL-IT-NOW || power-it-on" command to the peer's iLO (and promptly delete the flag).



Implementation cost: a wee bit of scripting and a wee bit of pacemaker configuration.



2.       Transient small network issues will require stretching pacemaker's default timeouts sufficiently to avoid (or end up in the item 1 bucket above)

Am very keen to know what the gurus think :) :)

Miki Shapiro
Linux Systems Engineer
Infrastructure Services & Operations

[cid:image001.png at 01CA950C.CC5921A0]
745 Springvale Road
Mulgrave 3170 Australia
Email miki.shapiro at coles.com.au<mailto:miki.shapiro at coles.com.au>
Phone: 61 3 854 10520
Fax:     61 3 854 10558


______________________________________________________________________
This email and any attachments may contain privileged and confidential
information and are intended for the named addressee only. If you have
received this e-mail in error, please notify the sender and delete
this e-mail immediately. Any confidentiality, privilege or copyright
is not waived or lost because this e-mail has been sent to you in
error. It is your responsibility to check this e-mail and any
attachments for viruses. No warranty is made that this material is
free from computer virus or any other defect or error. Any
loss/damage incurred by using this material is not the sender's
responsibility. The sender's entire liability will be limited to
resupplying the material.
______________________________________________________________________

_______________________________________________
Pacemaker mailing list
Pacemaker at oss.clusterlabs.org<mailto:Pacemaker at oss.clusterlabs.org>
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


______________________________________________________________________
This email and any attachments may contain privileged and confidential
information and are intended for the named addressee only. If you have
received this e-mail in error, please notify the sender and delete
this e-mail immediately. Any confidentiality, privilege or copyright
is not waived or lost because this e-mail has been sent to you in
error. It is your responsibility to check this e-mail and any
attachments for viruses.  No warranty is made that this material is
free from computer virus or any other defect or error.  Any
loss/damage incurred by using this material is not the sender's
responsibility.  The sender's entire liability will be limited to
resupplying the material.
______________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100114/7c36de18/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 1637 bytes
Desc: image001.png
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100114/7c36de18/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 162 bytes
Desc: image002.png
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100114/7c36de18/attachment-0007.png>


More information about the Pacemaker mailing list