[ClusterLabs] [External] : Re: Fence Agent tests
Jehan-Guillaume de Rorthais
jgdr at dalibo.com
Sat Nov 5 16:44:47 EDT 2022
On Sat, 5 Nov 2022 20:53:09 +0100
Valentin Vidić via Users <users at clusterlabs.org> wrote:
> On Sat, Nov 05, 2022 at 06:47:59PM +0000, Robert Hayden wrote:
> > That was my impression as well...so I may have something wrong. My
> > expectation was that SBD daemon should be writing to the /dev/watchdog
> > within 20 seconds and the kernel watchdog would self fence.
>
> I don't see anything unusual in the config except that pacemaker mode is
> also enabled. This means that the cluster is providing signal for sbd even
> when the storage device is down, for example:
>
> 883 ? SL 0:00 sbd: inquisitor
> 892 ? SL 0:00 \_ sbd: watcher: /dev/vdb1 - slot: 0 - uuid: ...
> 893 ? SL 0:00 \_ sbd: watcher: Pacemaker
> 894 ? SL 0:00 \_ sbd: watcher: Cluster
>
> You can strace different sbd processes to see what they are doing at any
> point.
I suspect both watchers should detect the loss of network/communication with
the other node.
BUT, when sbd is in Pacemaker mode, it doesn't reset the node if the
local **Pacemaker** is still quorate (via corosync). See the full chapter:
«If Pacemaker integration is activated, SBD will not self-fence if **device**
majority is lost [...]»
https://documentation.suse.com/sle-ha/15-SP4/html/SLE-HA-all/cha-ha-storage-protect.html
Would it be possible that no node is shutting down because the cluster is in
two-node mode? Because of this mode, both would keep the quorum expecting the
fencing to kill the other one... Except there's no active fencing here, only
"self-fencing".
To verify this guess, check the corosync conf for the "two_node" parameter and
if both nodes still report as quorate during network outage using:
corosync-quorumtool -s
If this turn to be a good guess, without **active** fencing, I suppose a cluster
can not rely on the two-node mode. I'm not sure what would be the best setup
though.
Regards,
More information about the Users
mailing list