<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">On 1/6/20 8:40 AM, Jerry Kross wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAJ4ao1VrhtcT1-tYUAgf_6zthw+0UdKx5Afk1f8z_cVTssmnjg@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">Hi Klaus,

        <div>Wishing you a great 2020!</div>

      </div>

    </blockquote>

    Same to you!<br>

    <blockquote type="cite"

cite="mid:CAJ4ao1VrhtcT1-tYUAgf_6zthw+0UdKx5Afk1f8z_cVTssmnjg@mail.gmail.com">

      <div dir="ltr">

        <div>We're using 3 SBD disks with pacemaker integration. It just

          happened once and am able to reproduce the latency error

          messages in the system log by inducing a network delay in the

          VM that hosts the SBD disks. These are the only messages that

          were logged before the VM restarted.</div>

      </div>

    </blockquote>

    You mean you can reproduce the latency messages but they don't<br>

    trigger a reboot - right?<br>

    <blockquote type="cite"

cite="mid:CAJ4ao1VrhtcT1-tYUAgf_6zthw+0UdKx5Afk1f8z_cVTssmnjg@mail.gmail.com">

      <div dir="ltr">

        <div>From the SBD documentation,  <a

            href="https://www.mankier.com/8/sbd" moz-do-not-send="true">https://www.mankier.com/8/sbd</a>.,

          it says that having 1 SBD disk does not introduce a single

          point of failure. I also tested this configuration by

          offlining a disk and pacemaker worked just fine. From your

          experience, is it safe to run the cluster with one SBD disk?

          This is a 2 node Hana database cluster, where one is primary.

          The data is replicated using the native database tools. So,

          there's no shared DB storage and the chances of a split-brain

          scenario is less likely to occur. This is because, the

          secondary database does not accept any writes.</div>

      </div>

    </blockquote>

    When setup properly so that a node reboots if it looses<br>

    its pacemaker-partner and the disk at the same time a 2-node<br>

    cluster with SBD and a single disk should be safe to operate.<br>

    As you already pointed out the disk isn't a SPOF as a node will<br>

    still provide service as long as it sees the partner.<br>

    Stating the obvious: Using just a single disk with pacemaker<br>

    integration isn't raising the risk of split-brain but rather<br>

    raises the risk of an unneeded node-reboot. So if your setup<br>

    is likely to e.g. loose the connection between the<br>

    partner-nodes and that to the disk simultaneously it may<br>

    be interesting to have something like 3 disks a 3 sites or<br>

    step away from 2-node-config in corosync in favor of real<br>

    quorum using qdevice.<br>

    I'm not very familiar with Hana-specific issue though.<br>

    <br>

    Klaus<br>

    <blockquote type="cite"

cite="mid:CAJ4ao1VrhtcT1-tYUAgf_6zthw+0UdKx5Afk1f8z_cVTssmnjg@mail.gmail.com">

      <div dir="ltr">

        <div>Regards,</div>

        <div>JK</div>

        <div><br>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Thu, Jan 2, 2020 at 6:35 PM

          Klaus Wenninger <<a href="mailto:kwenning@redhat.com"

            moz-do-not-send="true">kwenning@redhat.com</a>> wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px

          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On

          12/26/19 9:27 AM, Roger Zhou wrote:<br>

          > On 12/24/19 11:48 AM, Jerry Kross wrote:<br>

          >> Hi,<br>

          >> The pacemaker cluster manages a 2 node database

          cluster configured to use 3 <br>

          >> iscsi disk targets in its stonith configuration. The

          pacemaker cluster was put <br>

          >> in maintenance mode but we see SBD writing to the

          system logs. And just after <br>

          >> these logs, the production node was restarted.<br>

          >> Log:<br>

          >> sbd[5955]:  warning: inquisitor_child: Latency: No

          liveness for 37 s exceeds <br>

          >> threshold of 36 s (healthy servants: 1)<br>

          >> I see these messages logged and then the node was

          restarted. I suspect if it <br>

          >> was the softdog module that restarted the node but I

          don't see it in the logs. <br>

          Just to understand your config ...<br>

          You are using 3 block-devices with quorum amongst each other

          without<br>

          pacemaker-integration - right?<br>

          Might be that the disk-watchers are hanging on some io so that<br>

          we don't see any logs from them.<br>

          Did that happen just once or can you reproduce the issue?<br>

          If you are not using pacemaker-integration so far that might

          be a<br>

          way to increase reliability. (If it sees the other node sbd

          would be content<br>

          without getting response from the disks.) Of course it depends

          on your<br>

          distribution<br>

          and sbd-version if that would be supported with a

          2-node-cluster<br>

          (or at all). sbd e.g. would have to have at least<br>

          <a

href="https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377"

            rel="noreferrer" target="_blank" moz-do-not-send="true">https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377</a><br>

          <br>

          Klaus <br>

          > sbd is too critical to share the io path with others.<br>

          ><br>

          > Very likely, the workload is too heavy, the iscsi

          connections are broken and <br>

          > sbd looses the access to the disks, then sbd use sysrq

          'b' to reboot the node <br>

          > brutally and immediately.<br>

          ><br>

          > In regarding to watchdog-reboot, it kicks in when sbd is

          not able to tickle it <br>

          > in time, eg. sbd starves for cpu, or is crashed. It is

          crucial too, but not <br>

          > likely the case here.<br>

          ><br>

          > Merry X'mas and Happy New Year!<br>

          > Roger<br>

          ><br>

          > _______________________________________________<br>

          > Manage your subscription:<br>

          > <a

            href="https://lists.clusterlabs.org/mailman/listinfo/users"

            rel="noreferrer" target="_blank" moz-do-not-send="true">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

          ><br>

          > ClusterLabs home: <a href="https://www.clusterlabs.org/"

            rel="noreferrer" target="_blank" moz-do-not-send="true">https://www.clusterlabs.org/</a><br>

          <br>

          _______________________________________________<br>

          Manage your subscription:<br>

          <a href="https://lists.clusterlabs.org/mailman/listinfo/users"

            rel="noreferrer" target="_blank" moz-do-not-send="true">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

          <br>

          ClusterLabs home: <a href="https://www.clusterlabs.org/"

            rel="noreferrer" target="_blank" moz-do-not-send="true">https://www.clusterlabs.org/</a></blockquote>

      </div>

    </blockquote>

    <br>

  </body>

</html>