[Pacemaker] Split Site 2-way clusters

Andrew Beekhof andrew at beekhof.net
Wed Jan 13 03:23:52 EST 2010


On Wed, Jan 13, 2010 at 8:19 AM, Miki Shapiro <Miki.Shapiro at coles.com.au>wrote:

>  Separate to my earlier post re CRM DC election in a 2-way cluster, I’m
> chasing up the (separate) issue of making the cluster a CROSS-SITE one.
>
>
>
> As stated in yay other thread, I’m running a 2-way quorum-agnostic cluster
> on a SLES11, openais, pacemaker, drbd (… clvm, ocfs2, ctdb, nfs, etc) on HP
> Blades.
>
>
>
> A few old threads (with a rather elaborate answer from Lars) indicate that
> as of March 2009 split-site wasn’t yet thoroughly supported as WAN
> connectivity issues were not thoroughly addressed, and that as of then
> quorumd was not yet sufficiently robust/tested/PROD-ready.
>
>
>
> What we decided we want to do is rely on an extremely simple (and hopefully
> by inference predictable and reliable) arbitrator - a THIRD linux server
> that lives at a SEPARATE THIRD site altogether with no special HA-related
> daemons running on it.
>
>
>
> I’ll build a STONITH ocf script, configure it as a cloned STONITH resource
> running on both nodes, and it will do roughly this when pinging the other
> node (via either one or two redundant links) will fail:
>
>
>
> ssh arbitrator mkdir /tmp/$clustername && shoot-other-node ||
> hard-suicide-NOW
>
>
>
> Thus, when split, the nodes race to the arbitrator.
>
> First to run the mkdir command on the arbitrator (and get rc=0) wins, gets
> the long straw and lives.  Loser gets shot (either by its peer if WAN allows
> peer to communicate with soon-to-be-dead node’s iLO or by said node sealing
> its own fate).
>
>
>
> Primary failure modes not accounted for by a run-of-the-mill non-split-site
> cluster are thus:
>
>
>
> 1.       One node cut off – cutoff node will fail the race and suicide.
> Good node will succeed and proceed to provide service.
>
> 2.       Nodes cut off from each other but can both access the arbitrator
> – slower node will suicide. Faster node will succeed and proceed to provide
> service.
>
> 3.       Both nodes are cut off, or the comms issue affects both
> node1<->node2 comms AND all ->arbitrator comms (double failure).  – both
> nodes suicide (and potentially leave me with two inconsistent and
> potentially corrupt filesystems). Can’t see an easy way around this one (can
> anyone?)
>

Basically thats the part that the stuff we haven't written yet is supposed
to address.

You want to avoid the "|| hard-suicide-NOW" part of your logic, but you
can't safely do that unless there is some way to stop the services on the
non-connected node(s) - preferably _really_ quickly.

What about setting no-quorum-policy to freeze and making the third node a
full cluster member (that just doesn't run any resources)?
That way, if you get a 1-1-1 split the nodes will leave all services running
where they were and while it waits for quorum.
And if it heals into a 1-2 split, then the majority will terminate the rogue
node and acquire all the services.

The biggest problem is the reliability of your links and stonith devices -
give particular thought to how you'd fence _node_ A if comms to _site_ A are
down....

>
>
>
>
> Looks to me like this can easily be implemented without any fancy quorum
> servers (on top of the required little ocf script and the existence of the
> arbitrator)
>
>
>
> Does anyone have thoughts on this? Am I ignoring any major issues, or
> reinventing the wheel, or should this this potentially work as I think it
> will?
>
>
>
> Thanks! J
>
>
>
> And a little addendum which just occurred to me re transient WAN network
> issues:
>
>
>
> 1.       Transient big (>2min) network issues will land me with a cluster
> that needs a human to turn on one node on every time they happen. Bad.
>
>
>
> My proposed solution: classify a peer-failure as a WAN-problem by pinging
> peer node’s core router when peer node appears dead, if router dead too
> touch a WAN-problem-flagfile, and so long as the flag-file sits there the
> survivor pings (done via ocf ping resource) other-side-router until it comes
> online, then shooting a “check-power-status &&
> O-GOD-IT-STILL-LIVES-KILL-IT-NOW || power-it-on” command to the peer’s iLO
> (and promptly delete the flag).
>
>
>
> Implementation cost: a wee bit of scripting and a wee bit of pacemaker
> configuration.
>
>
>
> 2.       Transient small network issues will require stretching
> pacemaker’s default timeouts sufficiently to avoid (or end up in the item 1
> bucket above)
>
>
>
> Am very keen to know what the gurus think J J
>
> * *
>
> *Miki Shapiro*
>
> Linux Systems Engineer
> Infrastructure Services & Operations
>
>
>  745 Springvale Road
> Mulgrave 3170 Australia
> Email miki.shapiro at coles.com.au
> Phone: 61 3 854 10520
>
> Fax:     61 3 854 10558
>
>
>
> ______________________________________________________________________
> This email and any attachments may contain privileged and confidential
> information and are intended for the named addressee only. If you have
> received this e-mail in error, please notify the sender and delete
> this e-mail immediately. Any confidentiality, privilege or copyright
> is not waived or lost because this e-mail has been sent to you in
> error. It is your responsibility to check this e-mail and any
> attachments for viruses. No warranty is made that this material is
> free from computer virus or any other defect or error. Any
> loss/damage incurred by using this material is not the sender's
> responsibility. The sender's entire liability will be limited to
> resupplying the material.
> ______________________________________________________________________
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100113/2b9dd5c6/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 1637 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100113/2b9dd5c6/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 162 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100113/2b9dd5c6/attachment-0007.png>


More information about the Pacemaker mailing list