<div dir="ltr"><div dir="ltr" class="m_17571116495335811m_6941607664370540376gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div dir="ltr"></div></div></div><div><br></div><div>My objective is two-node active/passive DRBD device which would automatically fail over, a secondary objective would be to use standard, stock and supported software distributions and repositories with as little customization as possible.</div><div><br></div><div>I'm using Ubuntu 18.04.3, plus the DRBD, corosync and Pacemaker that are in the (LTS) repositories.  DRBD drbdadm reports version 8.9.10.  Corosync is 2.4.3, and Pacemaker is 0.9.164.</div><div><br></div><div>For my test scenario, I would have two nodes up and running, I would reboot, disconnect or shut down one node, and the other node would then after a delay take over.  That's the scenario I wanted to cover:  unexpected loss of a node.  The application is supplementary and isn't life safety or mission critical, but it would be measured, and the goal would be to stay above 4 nines of uptime annually.</div><div><br></div><div>All of this is working for me, I can manually failover by telling PCS to move my resource from one node to another.  If I reboot the primary node, the failover will not complete until the primary is back online.  Occasionally I'd get split-brain by doing these hard kills, which would require manual recovery.</div><div><br></div><div>I added STONITH and watchdog using SBD with an iSCSI block device and softdog.  </div><div><br></div><div>I added a qdevice to get an odd-numbered quorum.</div><div><br></div><div>When I run crm_simulate on this, the simulation says that if I down the primary node, it will promote the resource to the secondary.<br></div><div><br></div><div>And yet I still see the same behavior:  crashing the primary, there is no promotion until after the primary returns online, and after that the secondary is smoothly promoted and the primary demoted.</div><div><br></div><div>Getting each component of this stack configured and running has had substantial challenges, with regards to compatibility, documentation, integration bugs, etc.</div><div><br></div><div>I see other people reporting problems similar to mine, I'm wondering if there's a general approach, or perhaps I need a nudge in a new direction to tackle this issue?</div><div><br></div><div>* Should I continue to focus on the existing Pacemaker configuration?  perhaps there's some hidden or absent order/constraint/weighting that is causing this behavior?</div><div>* Should I dig harder at the DRBD configuration?  Is it something about the fencing scripts?</div><div>* Should I try stripping this back down to something more basic?  Can I have a reliable failover without STONITH, SBD and an odd-numbered quorum?<br></div><div>* It seems possible that moving to DRBD 9.X might take some of the problem off of Pacemaker altogether since it has built in failover apparently, is that an easier win?</div><div>* Should I go to another stack?  I'm trying to work within LTS releases for stability, but perhaps I would get better integrations with RHEL 7, CentOS 7, an edge release of Ubuntu, or some other distribution?</div><div><br></div><div>Thank you for your consideration!</div><div><br></div></div>