[ClusterLabs] Q: fence_kdump and fence_kdump_send

Mon Feb 28 08:46:42 EST 2022

On Sat, Feb 26, 2022 at 7:14 AM Strahil Nikolov via Users <
users at clusterlabs.org> wrote:

> I always used this one for triggering kdump when using sbd:
> https://www.suse.com/support/kb/doc/?id=000019873
>
> On Fri, Feb 25, 2022 at 21:34, Reid Wahl
> <nwahl at redhat.com> wrote:
> On Fri, Feb 25, 2022 at 3:47 AM Andrei Borzenkov <arvidjaar at gmail.com>
> wrote:
> >
> > On Fri, Feb 25, 2022 at 2:23 PM Reid Wahl <nwahl at redhat.com> wrote:
> > >
> > > On Fri, Feb 25, 2022 at 3:22 AM Reid Wahl <nwahl at redhat.com> wrote:
> > > >
> > ...
> > > > >
> > > > > So what happens most likely is that the watchdog terminates the
> kdump.
> > > > > In that case all the mess with fence_kdump won't help, right?
> > > >
> > > > You can configure extra_modules in your /etc/kdump.conf file to
> > > > include the watchdog module, and then restart kdump.service. For
> > > > example:
> > > >
> > > > # grep ^extra_modules /etc/kdump.conf
> > > > extra_modules i6300esb
> > > >
> > > > If you're not sure of the name of your watchdog module, wdctl can
> help
> > > > you find it. sbd needs to be stopped first, because it keeps the
> > > > watchdog device timer busy.
> > > >
> > > > # pcs cluster stop --all
> > > > # wdctl | grep Identity
> > > > Identity:      i6300ESB timer [version 0]
> > > > # lsmod | grep -i i6300ESB
> > > > i6300esb              13566  0
> > > >
> > > >
> > > > If you're also using fence_sbd (poison-pill fencing via block
> device),
> > > > then you should be able to protect yourself from that during a dump
> by
> > > > configuring fencing levels so that fence_kdump is level 1 and
> > > > fence_sbd is level 2.
> > >
> > > RHKB, for anyone interested:
> > >  - sbd watchdog timeout causes node to reboot during crash kernel
> > > execution (https://access.redhat.com/solutions/3552201)
> >
> > What is not clear from this KB (and quotes from it above) - what
> > instance updates watchdog? Quoting (emphasis mine)
> >
> > --><--
> > With the module loaded, the timer *CAN* be updated so that it does not
> > expire and force a reboot in the middle of vmcore generation.
> > --><--
> >
> > Sure it can, but what program exactly updates the watchdog during
> > kdump execution? I am pretty sure that sbd does not run at this point.
>
> That's a valid question. I found this approach to work back in 2018
> after a fair amount of frustration, and didn't question it too deeply
> at the time.
>
> The answer seems to be that the kernel does it.
>   - https://stackoverflow.com/a/2020717
>   - https://stackoverflow.com/a/42589110
>
> I think in most cases nobody would be triggering the running watchdog
except maybe in case of the 2 drivers mentioned.
Behavior is that if there is no watchdog-timeout defined for the
crashdump-case
sbd will (at least try to) disable the watchdog.
If disabling isn't prohibited or not possible with a certain watchdog this
should
lead to the hardware-watchdog being really disabled without anything needing
to trigger it anymore.
If crashdump-watchdog-timeout is configured to the same value as
watchdog-timeout engaged before sbd isn't gonna touch the watchdog
(closing the device without stopping).
That being said I'd suppose that the only somewhat production-safe
configuration should be setting both watchdog-timeouts to the same
value.
I doubt that we can assume that all io from the host  - that was initiated
prior to triggering the transition to crashdump-kernel - being stopped
immediately. All other nodes will assume that io will be stopped within
watchdog-timeout though. When we disable the watchdog we can't
be sure that subsequent transition to crashdump-kernel will even happen.
So leaving watchdog-timeout at the previous value seems to be
the only way to really assure that the node is being silenced by a
hardware-reset within the timeout assumed by the rest of the nodes.
In case the watchdog-driver has this running-detection - mentioned
in the links above - the safe way would probably be having the
module removed from crash-kernel.

Klaus

>
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
>
> --
> Regards,
>
> Reid Wahl (He/Him), RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20220228/8c67597a/attachment.htm>