[ClusterLabs] Antw: Re: Antw: [EXT] Re: Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

Klaus Wenninger kwenning at redhat.com
Thu Feb 17 07:34:07 EST 2022


On Thu, Feb 17, 2022 at 12:38 PM Ulrich Windl <
Ulrich.Windl at rz.uni-regensburg.de> wrote:

> >>> Klaus Wenninger <kwenning at redhat.com> schrieb am 17.02.2022 um 10:49
> in
> Nachricht
> <CALrDAo0UngyYybnv9xwve9V4suXvjOn-y8c8vD51ZR5LT1OpKw at mail.gmail.com>:
> ...
> >> For completeness: Yes, sbd did recover:
> >> Feb 14 13:01:42 h18 sbd[6615]:  warning: cleanup_servant_by_pid: Servant
> >> for /dev/disk/by-id/dm-name-SBD_1-3P1 (pid: 6619) has terminated
> >> Feb 14 13:01:42 h18 sbd[6615]:  warning: cleanup_servant_by_pid: Servant
> >> for /dev/disk/by-id/dm-name-SBD_1-3P2 (pid: 6621) has terminated
> >> Feb 14 13:01:42 h18 sbd[31668]: /dev/disk/by-id/dm-name-SBD_1-3P1:
> >>  notice: servant_md: Monitoring slot 4 on disk
> >> /dev/disk/by-id/dm-name-SBD_1-3P1
> >> Feb 14 13:01:42 h18 sbd[31669]: /dev/disk/by-id/dm-name-SBD_1-3P2:
> >>  notice: servant_md: Monitoring slot 4 on disk
> >> /dev/disk/by-id/dm-name-SBD_1-3P2
> >> Feb 14 13:01:49 h18 sbd[6615]:   notice: inquisitor_child: Servant
> >> /dev/disk/by-id/dm-name-SBD_1-3P1 is healthy (age: 0)
> >> Feb 14 13:01:49 h18 sbd[6615]:   notice: inquisitor_child: Servant
> >> /dev/disk/by-id/dm-name-SBD_1-3P2 is healthy (age: 0)
> >>
> >
> > Good to see that!
> > Did you try several times?
>
> Well, we only have two fabrics, and the server is productive, so both
> fabrics were interrupted once each (to change the cabling).
> sbd survived.
>
Yup - sometimes the entities that would have to be failed are just too large
to have them as part of the playground/sandbox :-(

>
> Second fabric:
> Feb 14 13:03:51 h18 kernel: qla2xxx [0000:01:00.0]-500b:2: LOOP DOWN
> detected (2 7 0 0).
> Feb 14 13:03:57 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 3
> Feb 14 13:03:57 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 2
>
> Feb 14 13:05:18 h18 kernel: qla2xxx [0000:01:00.0]-500a:2: LOOP UP
> detected (8 Gbps).
> Feb 14 13:05:22 h18 multipathd[5180]: SBD_1-3P2: sdr - tur checker reports
> path is up
> Feb 14 13:05:22 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 3
> Feb 14 13:05:23 h18 multipathd[5180]: SBD_1-3P2: sdae - tur checker
> reports path is up
> Feb 14 13:05:23 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 4
> Feb 14 13:05:25 h18 multipathd[5180]: SBD_1-3P1: sdl - tur checker reports
> path is up
> Feb 14 13:05:25 h18 multipathd[5180]: SBD_1-3P1: remaining active paths: 3
> Feb 14 13:05:26 h18 multipathd[5180]: SBD_1-3P1: sdo - tur checker reports
> path is up
> Feb 14 13:05:26 h18 multipathd[5180]: SBD_1-3P1: remaining active paths: 4
>
> So this time multipath reacted before SBD noticed anything (the way it
> should have been anyway)
>
Depends on how you like it to behave.
You are free to configure the io-timeout in a way that sbd wouldn't see it
or
if you'd rather have some notice in the sbd-logs, or the added reliability
of
kicking off another try instead of waiting for a first - maybe doomed - one
to
finish you give it enough time to retry within your msgwait-timeout.
Unfortunately it isn't possible to have one-fits-all defaults here.
But feedback is welcome so that we can do a little tweaking that makes them
fit
for a larger audience.
Remember a case where devices stalled for 50s during a firmware-update
shouldn't trigger fencing - definitely a case that can't be covered by
defaults.


> > I have some memory that when testing with the kernel mentioned before
> > behavior
> > changed after a couple of timeouts and it wasn't able to create the
> > read-request
> > anymore (without the fix mentioned) - assume some kind of resource
> depletion
> > due to previously hanging attempts not destroyed properly.
>
> That can be a nasty rece condition, too, however. (I had my share of
> signal handlers, threads and race conditions).
> Of course more crude programming errors are possible, too.
>
One single threaded process and it was gone once the api was handled
properly.
I mean the different behavior after a couple of retries was gone. The basic
issue
was persistent with that kernel.

> Debugging can be very hard, but dmsetup can create bad disks for testing
> for you ;-)
> DEV=bad_disk
> dmsetup create "$DEV" <<EOF
> 0 8 zero
> 8 1 error
> 9 7 zero
> 16 1 error
> 17 255 zero
> EOF
>
We need to impose the problem dynamically.
Otherwise sbd wouldn't come up in the first place - which is of course a
useful test
in itself as well.
Atm regressions.sh is using wipe_table to impose an error dynamically
but simultaneously on all blocks. The periodic reading is anyway done on
just
a single block (more accurately the header as well). So we should be fine
with that.
I saw that device-mapper offers a possibility to delay here as well. This
looks as
if it was useful for a CI test-case that simulates what we have here - even
multiple
times in a row without upsetting customers ;-)

Regards,
Klaus


>
> Regards,
> Ulrich
> ...
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20220217/d4560fb6/attachment-0001.htm>


More information about the Users mailing list