[ClusterLabs] SBD restarted the node while pacemaker in maintenance mode

Mon Jan 6 02:40:47 EST 2020

Hi Klaus,
Wishing you a great 2020!
We're using 3 SBD disks with pacemaker integration. It just happened once
and am able to reproduce the latency error messages in the system log by
inducing a network delay in the VM that hosts the SBD disks. These are the
only messages that were logged before the VM restarted.
>From the SBD documentation,  https://www.mankier.com/8/sbd., it says that
having 1 SBD disk does not introduce a single point of failure. I also
tested this configuration by offlining a disk and pacemaker worked just
fine. From your experience, is it safe to run the cluster with one SBD
disk? This is a 2 node Hana database cluster, where one is primary. The
data is replicated using the native database tools. So, there's no shared
DB storage and the chances of a split-brain scenario is less likely to
occur. This is because, the secondary database does not accept any writes.
Regards,
JK

On Thu, Jan 2, 2020 at 6:35 PM Klaus Wenninger <kwenning at redhat.com> wrote:

> On 12/26/19 9:27 AM, Roger Zhou wrote:
> > On 12/24/19 11:48 AM, Jerry Kross wrote:
> >> Hi,
> >> The pacemaker cluster manages a 2 node database cluster configured to
> use 3
> >> iscsi disk targets in its stonith configuration. The pacemaker cluster
> was put
> >> in maintenance mode but we see SBD writing to the system logs. And just
> after
> >> these logs, the production node was restarted.
> >> Log:
> >> sbd[5955]:  warning: inquisitor_child: Latency: No liveness for 37 s
> exceeds
> >> threshold of 36 s (healthy servants: 1)
> >> I see these messages logged and then the node was restarted. I suspect
> if it
> >> was the softdog module that restarted the node but I don't see it in
> the logs.
> Just to understand your config ...
> You are using 3 block-devices with quorum amongst each other without
> pacemaker-integration - right?
> Might be that the disk-watchers are hanging on some io so that
> we don't see any logs from them.
> Did that happen just once or can you reproduce the issue?
> If you are not using pacemaker-integration so far that might be a
> way to increase reliability. (If it sees the other node sbd would be
> content
> without getting response from the disks.) Of course it depends on your
> distribution
> and sbd-version if that would be supported with a 2-node-cluster
> (or at all). sbd e.g. would have to have at least
>
> https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377
>
> Klaus
> > sbd is too critical to share the io path with others.
> >
> > Very likely, the workload is too heavy, the iscsi connections are broken
> and
> > sbd looses the access to the disks, then sbd use sysrq 'b' to reboot the
> node
> > brutally and immediately.
> >
> > In regarding to watchdog-reboot, it kicks in when sbd is not able to
> tickle it
> > in time, eg. sbd starves for cpu, or is crashed. It is crucial too, but
> not
> > likely the case here.
> >
> > Merry X'mas and Happy New Year!
> > Roger
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20200106/6fbd8232/attachment.html>