<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 1/6/20 8:40 AM, Jerry Kross wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAJ4ao1VrhtcT1-tYUAgf_6zthw+0UdKx5Afk1f8z_cVTssmnjg@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">Hi Klaus,
<div>Wishing you a great 2020!</div>
</div>
</blockquote>
Same to you!<br>
<blockquote type="cite"
cite="mid:CAJ4ao1VrhtcT1-tYUAgf_6zthw+0UdKx5Afk1f8z_cVTssmnjg@mail.gmail.com">
<div dir="ltr">
<div>We're using 3 SBD disks with pacemaker integration. It just
happened once and am able to reproduce the latency error
messages in the system log by inducing a network delay in the
VM that hosts the SBD disks. These are the only messages that
were logged before the VM restarted.</div>
</div>
</blockquote>
You mean you can reproduce the latency messages but they don't<br>
trigger a reboot - right?<br>
<blockquote type="cite"
cite="mid:CAJ4ao1VrhtcT1-tYUAgf_6zthw+0UdKx5Afk1f8z_cVTssmnjg@mail.gmail.com">
<div dir="ltr">
<div>From the SBD documentation, <a
href="https://www.mankier.com/8/sbd" moz-do-not-send="true">https://www.mankier.com/8/sbd</a>.,
it says that having 1 SBD disk does not introduce a single
point of failure. I also tested this configuration by
offlining a disk and pacemaker worked just fine. From your
experience, is it safe to run the cluster with one SBD disk?
This is a 2 node Hana database cluster, where one is primary.
The data is replicated using the native database tools. So,
there's no shared DB storage and the chances of a split-brain
scenario is less likely to occur. This is because, the
secondary database does not accept any writes.</div>
</div>
</blockquote>
When setup properly so that a node reboots if it looses<br>
its pacemaker-partner and the disk at the same time a 2-node<br>
cluster with SBD and a single disk should be safe to operate.<br>
As you already pointed out the disk isn't a SPOF as a node will<br>
still provide service as long as it sees the partner.<br>
Stating the obvious: Using just a single disk with pacemaker<br>
integration isn't raising the risk of split-brain but rather<br>
raises the risk of an unneeded node-reboot. So if your setup<br>
is likely to e.g. loose the connection between the<br>
partner-nodes and that to the disk simultaneously it may<br>
be interesting to have something like 3 disks a 3 sites or<br>
step away from 2-node-config in corosync in favor of real<br>
quorum using qdevice.<br>
I'm not very familiar with Hana-specific issue though.<br>
<br>
Klaus<br>
<blockquote type="cite"
cite="mid:CAJ4ao1VrhtcT1-tYUAgf_6zthw+0UdKx5Afk1f8z_cVTssmnjg@mail.gmail.com">
<div dir="ltr">
<div>Regards,</div>
<div>JK</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Jan 2, 2020 at 6:35 PM
Klaus Wenninger <<a href="mailto:kwenning@redhat.com"
moz-do-not-send="true">kwenning@redhat.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On
12/26/19 9:27 AM, Roger Zhou wrote:<br>
> On 12/24/19 11:48 AM, Jerry Kross wrote:<br>
>> Hi,<br>
>> The pacemaker cluster manages a 2 node database
cluster configured to use 3 <br>
>> iscsi disk targets in its stonith configuration. The
pacemaker cluster was put <br>
>> in maintenance mode but we see SBD writing to the
system logs. And just after <br>
>> these logs, the production node was restarted.<br>
>> Log:<br>
>> sbd[5955]: warning: inquisitor_child: Latency: No
liveness for 37 s exceeds <br>
>> threshold of 36 s (healthy servants: 1)<br>
>> I see these messages logged and then the node was
restarted. I suspect if it <br>
>> was the softdog module that restarted the node but I
don't see it in the logs. <br>
Just to understand your config ...<br>
You are using 3 block-devices with quorum amongst each other
without<br>
pacemaker-integration - right?<br>
Might be that the disk-watchers are hanging on some io so that<br>
we don't see any logs from them.<br>
Did that happen just once or can you reproduce the issue?<br>
If you are not using pacemaker-integration so far that might
be a<br>
way to increase reliability. (If it sees the other node sbd
would be content<br>
without getting response from the disks.) Of course it depends
on your<br>
distribution<br>
and sbd-version if that would be supported with a
2-node-cluster<br>
(or at all). sbd e.g. would have to have at least<br>
<a
href="https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377</a><br>
<br>
Klaus <br>
> sbd is too critical to share the io path with others.<br>
><br>
> Very likely, the workload is too heavy, the iscsi
connections are broken and <br>
> sbd looses the access to the disks, then sbd use sysrq
'b' to reboot the node <br>
> brutally and immediately.<br>
><br>
> In regarding to watchdog-reboot, it kicks in when sbd is
not able to tickle it <br>
> in time, eg. sbd starves for cpu, or is crashed. It is
crucial too, but not <br>
> likely the case here.<br>
><br>
> Merry X'mas and Happy New Year!<br>
> Roger<br>
><br>
> _______________________________________________<br>
> Manage your subscription:<br>
> <a
href="https://lists.clusterlabs.org/mailman/listinfo/users"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>
><br>
> ClusterLabs home: <a href="https://www.clusterlabs.org/"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://www.clusterlabs.org/</a><br>
<br>
_______________________________________________<br>
Manage your subscription:<br>
<a href="https://lists.clusterlabs.org/mailman/listinfo/users"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>
<br>
ClusterLabs home: <a href="https://www.clusterlabs.org/"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://www.clusterlabs.org/</a></blockquote>
</div>
</blockquote>
<br>
</body>
</html>