[ClusterLabs] 2-node DRBD Pacemaker not performing as expected: Where to next?

Thu Aug 15 11:25:34 EDT 2019

My objective is two-node active/passive DRBD device which would
automatically fail over, a secondary objective would be to use standard,
stock and supported software distributions and repositories with as little
customization as possible.

I'm using Ubuntu 18.04.3, plus the DRBD, corosync and Pacemaker that are in
the (LTS) repositories.  DRBD drbdadm reports version 8.9.10.  Corosync is
2.4.3, and Pacemaker is 0.9.164.

For my test scenario, I would have two nodes up and running, I would
reboot, disconnect or shut down one node, and the other node would then
after a delay take over.  That's the scenario I wanted to cover:
unexpected loss of a node.  The application is supplementary and isn't life
safety or mission critical, but it would be measured, and the goal would be
to stay above 4 nines of uptime annually.

All of this is working for me, I can manually failover by telling PCS to
move my resource from one node to another.  If I reboot the primary node,
the failover will not complete until the primary is back online.
Occasionally I'd get split-brain by doing these hard kills, which would
require manual recovery.

I added STONITH and watchdog using SBD with an iSCSI block device and
softdog.

I added a qdevice to get an odd-numbered quorum.

When I run crm_simulate on this, the simulation says that if I down the
primary node, it will promote the resource to the secondary.

And yet I still see the same behavior:  crashing the primary, there is no
promotion until after the primary returns online, and after that the
secondary is smoothly promoted and the primary demoted.

Getting each component of this stack configured and running has had
substantial challenges, with regards to compatibility, documentation,
integration bugs, etc.

I see other people reporting problems similar to mine, I'm wondering if
there's a general approach, or perhaps I need a nudge in a new direction to
tackle this issue?

* Should I continue to focus on the existing Pacemaker configuration?
perhaps there's some hidden or absent order/constraint/weighting that is
causing this behavior?
* Should I dig harder at the DRBD configuration?  Is it something about the
fencing scripts?
* Should I try stripping this back down to something more basic?  Can I
have a reliable failover without STONITH, SBD and an odd-numbered quorum?
* It seems possible that moving to DRBD 9.X might take some of the problem
off of Pacemaker altogether since it has built in failover apparently, is
that an easier win?
* Should I go to another stack?  I'm trying to work within LTS releases for
stability, but perhaps I would get better integrations with RHEL 7, CentOS
7, an edge release of Ubuntu, or some other distribution?

Thank you for your consideration!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190815/092fc2bf/attachment.html>