[ClusterLabs] DLM recovery stuck (digression: Corosync watchdog experience)

Fri Aug 10 08:51:29 UTC 2018

FeldHost™ Admin <admin at feldhost.cz> writes:

> rule of thumb is use separate dedicated network for corosync traffic.
> For ex. we use two corosync rings, first and active one on separate
> network card and switch, second passive one on team (bond) device vlan.

Hi,

That's fine in principle, but this is a bladecenter setting, we can't
really use separate networks cards, it's a single chassis at the end of
the day.  Besides, we've not encountered Corosync glitches.  The
Corosync virtual network is shared with the DLM traffic only and has 200
Mb/s bandwidth dedicated to it in the interface (BIOS) setup.

Failure story for amusement: the blades expose a BMC watchdog device to
the OS, which was picked up by Corosync.  It seemed like a useful second
line of defense in case fencing (BMC IPMI power) failed for any reason;
I let it live and forgot about it.  Months later, after a firmware
upgrade the BMC had to be restarted, and the watchdog device ioctl
blocked Corosync for a minute or so.  Of course membership fell apart.
Actually, across the full cluster, because the BMC restarts were
preformed back-to-back (I authorized a single restart only, but anyway).
I leave the rest to your imagination.  Fencing (STONITH) worked (with
delays) until quorum dissolved entirely... after a couple of minutes, it
was over.  We spent the rest of the day picking up the pieces, then the
next few trying to reproduce the perceived Corosync network outage
during BMC reboots without the cluster stack running.  Of course in
total vain.  Half a year later an independent investigation of sporadic
small Corosync delays revealed the watchdog connection, then we disabled
the feature.  Don't use (poorly implemented) BMC watchdogs.
-- 
Feri