No subject

Wed Mar 13 22:36:09 EDT 2019


I'd like to get a clear idea of what the roadblocks --actually are-- (not =
on a "The WAN link" level but what the WAN link -actually breaks-) to doin=
g what I suggested.

Assuming I can get it to work, are there any other specific reasons it wou=

To recap, in my proposed solution, an outage will result in four things:
1. A "Race" by both nodes to a 3rd site, to perform an atomic operation (a=
 mkdir for instance). Following it, it will be abundantly clear to both no=
des "who is right, and who is dead".
2. A hard-iLO-poweroff STONITH (NOT reboot!) from the winner to the loser'=
s iLO. It can  also iptables-block all comms from the loser until further =
notice as an extra safety-net.=20
3. A hard-own-iLO-poweroff-else-kernel-halt SMITH (NOT reboot!) suicide by=
 the loser (SMITH is our pet acronym for Shoot-Myself-...).
4. A "WAN-PROBLEM=3D[true|false] flag immediately raised (locally) by the =
winner based on pinging the OTHER SITE's ROUTER. A separate resource on th=
e winner will, in the presence of this flag, monitor the same router of th=
e other site for life, and when the other site comes back up (perhaps -and=
-stays-up-for-an-hour- or some similar flap-avoiding logic) issues a POWER=
ON to the other node's iLO which will come back up as a drbd slave, resync=
 and get re-promoted to master.

As an attractive side-benefit, this is a deathmatch-proof design.


NOTE: There's a departure from common wisdom here, and I am not sure wheth=
er this one of the issues you're pointing at.=20
Common wisdom states: SMITH BAD, not reliable=20(obvious reasons - no succ=
ess/failure etc)

In this solution I claim: SMIT BAD, not reliable, except in one specific f=
ailure mode (WAN outage) where SMITH GOOD, is reliable, shortcomings can b=
e worked around.

both steps [2] and [3] are issued on EVERY TYPE of outage, regardless of w=
hether it's WAN-related or not.=20
In non-WAN issues the loser is considered compromised, thus making [3] unr=
eliable, but [2] is reliable.
In WAN issues, the WAN is considered compromised, thus making [2] unreliab=
le, but the node itself is sound, so [3] still is reliable.

To sum up, it looks to me like the "data safety" is provided by the layer =
underneath DRBD, not DRBD itself, and if it works as advertised, DRBD shou=
ld have no problem, thus we have a system sufficiently reliable to withsta=
nd any scenario short of a double failure.=20

... thoughts?

-----Original Message-----
From: Florian Haas [mailto:florian.haas at]=20
Sent: Monday, 18 January 2010 9:36 PM
To: pacemaker at
Subject: Re: [Pacemaker] Split Site 2-way clusters

On 2010-01-18 11:14, Andrew Beekhof wrote:
> On Thu, Jan 14, 2010 at 11:44 PM, Miki Shapiro=20
> <Miki.Shapiro at> wrote:
>> Confused.
>> I *am* running DRBD in dual-master mode
> /me cringes... this sounds to me like an impossibly dangerous idea.
> Can someone from linbit comment on this please?  Am I imagining this?

Dual-Primary DRBD in a split site cluster? Really really bad idea.
Anyone attempting this, please search the drbd-user archives for multiple =
discussions about this in the past. Then reconsider.

Hope that makes it clear enough.

This email and any attachments may contain privileged and confidential
information and are intended for the named addressee only. If you have
received this e-mail in error, please notify the sender and delete
this e-mail immediately. Any confidentiality, privilege or copyright
is not waived or lost because this e-mail has been sent to you in
error. It is your responsibility to check this e-mail and any
attachments for viruses.  No warranty is made that this material is
free from computer virus or any other defect or error.  Any
loss/damage incurred by using this material is not the sender's
responsibility.  The sender's entire liability will be limited to
resupplying the material.

More information about the Pacemaker mailing list