[ClusterLabs] SBD & Failed Peer

Sun Sep 6 21:28:21 EDT 2015

On 09/06/2015 04:23 PM, Jorge Fábregas wrote:
> Assume an active/active cluster using OCFS2 and SBD with shared storage.
> Then one node explodes (the hardware watchdog is gone as well
> obviously).  

Ok I did two tests with this setup on my KVM lab  (one with SBD with
shared-storage and the other with hypervisor-based STONITH
(external/libvirt) while actively writing to an ocfs2 filesystem.

## SBD with shared-storage

Shut off one node abruptly (VM power-off) . Result: DLM/OCFS2 blocked
for about 30 to 40 seconds and then it resumed.  That's nice!  I think
at this moment (when resuming) the assumptions were:

 -if the peer were alive it would have swallowed the poison pill we just
placed
- if the peer is freezed the watchdog would have taken care of him
- we just wait a little extra bit before continuing...

(I really don't know if checking when was the last update of your
partner - on the SBD disk-  is part of the role of the SBD daemon)

## External/Libvirt

I shut off one node but then disabled SSH on KVM host (so that fencing
via qemu+ssh couldn't work).  Result: it blocked FOREVER.

Am I right in thinking that SBD is the way to go when using OCFS2
filesystems?  (compared to hypervisor-based fencing or management-boards
like iLO, DRAC etc)?

Now, the only thing I don't like about SBD is that when it loses contact
with the shared disk, both nodes commit suicide.  I found out there's
the "-P" option to SBD (that's supposed to prevent that as long as
there's cluster communication) but it doesn't work in my SLES 11 SP4
setup.  Maybe it's on SLES 12.

Thanks,
Jorge