[Pacemaker] DRBD Split Brain after each reboot

Lars Ellenberg lars.ellenberg at linbit.com
Tue Dec 22 13:02:30 EST 2009


On Tue, Dec 22, 2009 at 06:29:09PM +0100, andschais at gmail.com wrote:
> Thanks for your reply Lars, I was pretty sure that start-delay was just a
> workaround and not a fix, but it looks like a "not waiting long enough"
> problem.
> In fact, after searching for a while I found this bug on debian init script:
> 
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=556533
> 
> After applying the patch, my 2 nodes cluster works perfect and no more split
> brain after reboot nor errors at log files.

That patch is simply wrong.
please don't do that.

what it did:
 send a QUIT, wait 5 seconds, then send a KILL.

no wonder it won't work.

what it does now:
 send a QUIT, wait 70 seconds, then send a KILL.

obviously, that does fix it.
NOT.

If someone has some stop timeouts (oracle shutdown, SAP shutdown,
similar things that take a while to cleanly shutdown), which *in sum*
end up being more than 70 seconds, that is still broken.

note the "in sum" part. just having 15 resources stopping sequentially,
each taking 5 seconds to shut down cleanly, triggers the KILL again!

As Beekhof said in the relevant bugzilla there,
there SHOULD NOT be an upper limit.

If there has to be because of some "policy",
then at least make it default to an hour,
and have it configurable via some default file.
and escalate to TERM, not to KILL.

but debian seems to think the bug would be closed
 :(

so, better patch is probably:
remove the --retry $whatever_seconds,
so it simply sends the QUIT once,
and then waits for it to take effect.

any debian maintainer around?
is this fixed now?
if not, would reopening the bug help?


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.




More information about the Pacemaker mailing list