[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Re: Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Thu Feb 17 09:25:27 EST 2022

>>> Klaus Wenninger <kwenning at redhat.com> schrieb am 17.02.2022 um 13:34 in
<CALrDAo3C9mOG1F2Pk2xtcwk_b399f9XgY=wH8+OqDc4c5Le4iw at mail.gmail.com>:
> But feedback is welcome so that we can do a little tweaking that makes them
> fit
> for a larger audience.
> Remember a case where devices stalled for 50s during a firmware-update
> shouldn't trigger fencing - definitely a case that can't be covered by
> defaults.

It all depends: Say your service cannot stand a delay of 50s caused by some "disturbance", then will a fencing (let's say after 40 seconds) bring the service up again before the 50 seconds did elapse? Or even worse: When the fenced node is needed for service, will it have completed reboot and running the cluster services before the 50 seconds elapsed?
I'm afraid for most services the answer will be "no".

However when having waited already 40 seconds, it is possible that things will continue after 50 seconds altogether, or is could be that you'll have to wait for hours.
Unfortunately no software can predict what will happen in the future, so there's a likelihood that fencing now is better than continuing to wait. However (for the same reason), there's no guarantee that fencing will improve the situation (the shared storage system may still be "pausing", and no node can continue).
It's all a bit like hard-mounting NFS exported from a HA-NFS server: You hope the server will be operating soon while waiting, even though in some cases it would be better to get an I/O error to the application, allowing it to react... It all depends...

(on the dmsetup example providing a bad disk)
The example was not intended for sbd; it was for a program of mine that had to deal with read/write errors.
For sbd dm-flakey (dm-dust) or dm-delay might be better, but still those are very deterministic (unless updated via "dmsetup message")


