[ClusterLabs] SBD restarted the node while pacemaker in maintenance mode
Klaus Wenninger
kwenning at redhat.com
Tue Jan 7 05:20:40 EST 2020
On 1/6/20 8:40 AM, Jerry Kross wrote:
> Hi Klaus,
> Wishing you a great 2020!
Same to you!
> We're using 3 SBD disks with pacemaker integration. It just happened
> once and am able to reproduce the latency error messages in the system
> log by inducing a network delay in the VM that hosts the SBD disks.
> These are the only messages that were logged before the VM restarted.
You mean you can reproduce the latency messages but they don't
trigger a reboot - right?
> From the SBD documentation, https://www.mankier.com/8/sbd., it says
> that having 1 SBD disk does not introduce a single point of failure. I
> also tested this configuration by offlining a disk and pacemaker
> worked just fine. From your experience, is it safe to run the cluster
> with one SBD disk? This is a 2 node Hana database cluster, where one
> is primary. The data is replicated using the native database tools.
> So, there's no shared DB storage and the chances of a split-brain
> scenario is less likely to occur. This is because, the secondary
> database does not accept any writes.
When setup properly so that a node reboots if it looses
its pacemaker-partner and the disk at the same time a 2-node
cluster with SBD and a single disk should be safe to operate.
As you already pointed out the disk isn't a SPOF as a node will
still provide service as long as it sees the partner.
Stating the obvious: Using just a single disk with pacemaker
integration isn't raising the risk of split-brain but rather
raises the risk of an unneeded node-reboot. So if your setup
is likely to e.g. loose the connection between the
partner-nodes and that to the disk simultaneously it may
be interesting to have something like 3 disks a 3 sites or
step away from 2-node-config in corosync in favor of real
quorum using qdevice.
I'm not very familiar with Hana-specific issue though.
Klaus
> Regards,
> JK
>
>
> On Thu, Jan 2, 2020 at 6:35 PM Klaus Wenninger <kwenning at redhat.com
> <mailto:kwenning at redhat.com>> wrote:
>
> On 12/26/19 9:27 AM, Roger Zhou wrote:
> > On 12/24/19 11:48 AM, Jerry Kross wrote:
> >> Hi,
> >> The pacemaker cluster manages a 2 node database cluster
> configured to use 3
> >> iscsi disk targets in its stonith configuration. The pacemaker
> cluster was put
> >> in maintenance mode but we see SBD writing to the system logs.
> And just after
> >> these logs, the production node was restarted.
> >> Log:
> >> sbd[5955]: warning: inquisitor_child: Latency: No liveness for
> 37 s exceeds
> >> threshold of 36 s (healthy servants: 1)
> >> I see these messages logged and then the node was restarted. I
> suspect if it
> >> was the softdog module that restarted the node but I don't see
> it in the logs.
> Just to understand your config ...
> You are using 3 block-devices with quorum amongst each other without
> pacemaker-integration - right?
> Might be that the disk-watchers are hanging on some io so that
> we don't see any logs from them.
> Did that happen just once or can you reproduce the issue?
> If you are not using pacemaker-integration so far that might be a
> way to increase reliability. (If it sees the other node sbd would
> be content
> without getting response from the disks.) Of course it depends on your
> distribution
> and sbd-version if that would be supported with a 2-node-cluster
> (or at all). sbd e.g. would have to have at least
> https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377
>
> Klaus
> > sbd is too critical to share the io path with others.
> >
> > Very likely, the workload is too heavy, the iscsi connections
> are broken and
> > sbd looses the access to the disks, then sbd use sysrq 'b' to
> reboot the node
> > brutally and immediately.
> >
> > In regarding to watchdog-reboot, it kicks in when sbd is not
> able to tickle it
> > in time, eg. sbd starves for cpu, or is crashed. It is crucial
> too, but not
> > likely the case here.
> >
> > Merry X'mas and Happy New Year!
> > Roger
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20200107/0de8f582/attachment-0001.html>
More information about the Users
mailing list