<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    Hello,<br>

    <br>

    Its my first post on this mailing list so excuse any rookie mistake

    I may do in this thread.<br>

    <br>

    We currently have clusters deployed using corosync/pacemaker that

    manage DRBD + a couple of systemd services.<br>

    <br>

    My colleague Derek previously emailed the list about it but has left

    the company since then:<br>

<a class="moz-txt-link-freetext" href="http://lists.clusterlabs.org/pipermail/users/2017-November/006796.html">http://lists.clusterlabs.org/pipermail/users/2017-November/006796.html</a><br>

    <br>

    I'm hoping to continue his work in order to fix it once and for all.<br>

    <br>

    I looked into the Q&A that was done in that thread and have

    managed to track it down to the following:<br>

    - If I reboot the server that is running as the primary (DRBD +

    systemd resources started), then when it completes reboot, there is

    a split-brain<br>

    - If I stop pacemaker (systemctl stop pacemaker), then reboot that

    primary server, then it comes back online without any issues and no

    split-brain<br>

    - If I reboot the server that doesn't have the running resources,

    all goes well<br>

    <br>

    Following those observations, my guess is that the way the pacemaker

    services are being stopped during a systemd shutdown is causing

    issues.<br>

    It seems that pacemaker isn't stopping the systemd resources in that

    case and thus, not un-mounting the DRBD partition, putting it in

    secondary before stopping DRBD which results in the split-brain.<br>

    <br>

    Here is the interesting bit I found in the logs:<br>

    <font size="-1" face="Courier New, Courier, monospace">Dec 13

      14:09:40 act-pass-2 lrmd[1133]:    error: Could not connect to

      System DBus: Did not receive a reply. Possible causes include: the

      remote application did not send a reply, the message bus security

      policy blocked the reply, the reply timeout expired, or the

      network connection was broken.<br>

      Dec 13 14:09:40 act-pass-2 lrmd[1133]:    error:

      systemd_unit_exec: Triggered fatal assert at systemd.c:730 :

      systemd_init()<br>

      Dec 13 14:09:40 act-pass-2 pacemakerd[1083]:    error: Managed

      process 1133 (lrmd) dumped core<br>

      Dec 13 14:09:40 act-pass-2 pacemakerd[1083]:    error: The lrmd

      process (1133) terminated with signal 6 (core=1)</font><br>

    <br>

    And a pastebin of the full journald output during the shutdown<br>

    <a class="moz-txt-link-freetext" href="https://pastebin.com/CB38BiwC">https://pastebin.com/CB38BiwC</a><br>

    <br>

    Not sure where to go from there, may be a dependency to another

    systemd resource but it seems more like an issue connecting to

    systemd itself to stop the systemd resources of the cluster (that's

    a wild guess) since systemd isn't accepting commands since its

    stopping. At this point, this goes beyond my knowledge of systemd so

    I'd like some guidance on any required adjustment or further

    necessary troubleshooting.<br>

    <br>

    Best Regards,<br>

    <br>

    <pre class="moz-signature" cols="72">-- 

Julien Semaan

<a class="moz-txt-link-abbreviated" href="mailto:jsemaan@inverse.ca">jsemaan@inverse.ca</a>  ::  +1 (866) 353-6153 *155  ::  <a class="moz-txt-link-abbreviated" href="http://www.inverse.ca">www.inverse.ca</a>

Inverse inc. :: Leaders behind SOGo (<a class="moz-txt-link-abbreviated" href="http://www.sogo.nu">www.sogo.nu</a>) and PacketFence (<a class="moz-txt-link-abbreviated" href="http://www.packetfence.org">www.packetfence.org</a>) </pre>

  </body>

</html>