[ClusterLabs] Antw: Re: Antw: [EXT] Re: Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Thu Feb 17 06:38:26 EST 2022


>>> Klaus Wenninger <kwenning at redhat.com> schrieb am 17.02.2022 um 10:49 in
Nachricht
<CALrDAo0UngyYybnv9xwve9V4suXvjOn-y8c8vD51ZR5LT1OpKw at mail.gmail.com>:
...
>> For completeness: Yes, sbd did recover:
>> Feb 14 13:01:42 h18 sbd[6615]:  warning: cleanup_servant_by_pid: Servant
>> for /dev/disk/by-id/dm-name-SBD_1-3P1 (pid: 6619) has terminated
>> Feb 14 13:01:42 h18 sbd[6615]:  warning: cleanup_servant_by_pid: Servant
>> for /dev/disk/by-id/dm-name-SBD_1-3P2 (pid: 6621) has terminated
>> Feb 14 13:01:42 h18 sbd[31668]: /dev/disk/by-id/dm-name-SBD_1-3P1:
>>  notice: servant_md: Monitoring slot 4 on disk
>> /dev/disk/by-id/dm-name-SBD_1-3P1
>> Feb 14 13:01:42 h18 sbd[31669]: /dev/disk/by-id/dm-name-SBD_1-3P2:
>>  notice: servant_md: Monitoring slot 4 on disk
>> /dev/disk/by-id/dm-name-SBD_1-3P2
>> Feb 14 13:01:49 h18 sbd[6615]:   notice: inquisitor_child: Servant
>> /dev/disk/by-id/dm-name-SBD_1-3P1 is healthy (age: 0)
>> Feb 14 13:01:49 h18 sbd[6615]:   notice: inquisitor_child: Servant
>> /dev/disk/by-id/dm-name-SBD_1-3P2 is healthy (age: 0)
>>
> 
> Good to see that!
> Did you try several times?

Well, we only have two fabrics, and the server is productive, so both fabrics were interrupted once each (to change the cabling).
sbd survived.

Second fabric:
Feb 14 13:03:51 h18 kernel: qla2xxx [0000:01:00.0]-500b:2: LOOP DOWN detected (2 7 0 0).
Feb 14 13:03:57 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 3
Feb 14 13:03:57 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 2

Feb 14 13:05:18 h18 kernel: qla2xxx [0000:01:00.0]-500a:2: LOOP UP detected (8 Gbps).
Feb 14 13:05:22 h18 multipathd[5180]: SBD_1-3P2: sdr - tur checker reports path is up
Feb 14 13:05:22 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 3
Feb 14 13:05:23 h18 multipathd[5180]: SBD_1-3P2: sdae - tur checker reports path is up
Feb 14 13:05:23 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 4
Feb 14 13:05:25 h18 multipathd[5180]: SBD_1-3P1: sdl - tur checker reports path is up
Feb 14 13:05:25 h18 multipathd[5180]: SBD_1-3P1: remaining active paths: 3
Feb 14 13:05:26 h18 multipathd[5180]: SBD_1-3P1: sdo - tur checker reports path is up
Feb 14 13:05:26 h18 multipathd[5180]: SBD_1-3P1: remaining active paths: 4

So this time multipath reacted before SBD noticed anything (the way it should have been anyway)

> I have some memory that when testing with the kernel mentioned before
> behavior
> changed after a couple of timeouts and it wasn't able to create the
> read-request
> anymore (without the fix mentioned) - assume some kind of resource depletion
> due to previously hanging attempts not destroyed properly.

That can be a nasty rece condition, too, however. (I had my share of signal handlers, threads and race conditions).
Of course more crude programming errors are possible, too.
Debugging can be very hard, but dmsetup can create bad disks for testing for you ;-)
DEV=bad_disk
dmsetup create "$DEV" <<EOF
0 8 zero
8 1 error
9 7 zero
16 1 error
17 255 zero
EOF

Regards,
Ulrich
...




More information about the Users mailing list