[ClusterLabs] digression: Corosync watchdog experience

Fri Aug 10 12:26:11 UTC 2018

On 10/08/18 10:51 +0200, Ferenc Wágner wrote:
> Failure story for amusement: the blades expose a BMC watchdog device to
> the OS, which was picked up by Corosync.  It seemed like a useful second
> line of defense in case fencing (BMC IPMI power) failed for any reason;
> I let it live and forgot about it.  Months later, after a firmware
> upgrade the BMC had to be restarted, and the watchdog device ioctl
> blocked Corosync for a minute or so.  Of course membership fell apart.
> Actually, across the full cluster, because the BMC restarts were
> preformed back-to-back (I authorized a single restart only, but anyway).
> I leave the rest to your imagination.  Fencing (STONITH) worked (with
> delays) until quorum dissolved entirely... after a couple of minutes, it
> was over.  We spent the rest of the day picking up the pieces, then the
> next few trying to reproduce the perceived Corosync network outage
> during BMC reboots without the cluster stack running.  Of course in
> total vain.  Half a year later an independent investigation of sporadic
> small Corosync delays revealed the watchdog connection, then we disabled
> the feature.  Don't use (poorly implemented) BMC watchdogs.

Thanks for sharing these lessons learned, good to be reminded how far
the SPOF risks spread, even at seemingly improbable places.  For
instance, with software-only watchdog, they are substantially more
blatant, so while its standalone configuration cannot be recommended
sensibly, using it as a backup watchdog may not be that bad idea,
after all (loop over all configured watchdogs opened with O_NONBLOCK
flag?).

-- 
Nazdar,
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180810/ecbd7035/attachment.sig>