[ClusterLabs] Antw: [EXT] Re: normal reboot with active sbd does not work

Klaus Wenninger kwenning at redhat.com
Tue Jun 7 04:28:17 EDT 2022


On Tue, Jun 7, 2022 at 7:53 AM Ulrich Windl
<Ulrich.Windl at rz.uni-regensburg.de> wrote:
>
> >>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 03.06.2022 um 17:04 in
> Nachricht <99f7746a-c962-33bb-6737-f88ba0128a7c at gmail.com>:
> > On 03.06.2022 16:51, Zoran Bošnjak wrote:
> >> Thanks for all your answers. Sorry, my mistake. The ipmi_watchdog is indeed
>
> > OK. I was first experimenting with "softdog", which is blacklisted. So the
> > reasonable question is how to properly start "softdog" on ubuntu.
> >>
> >
> > blacklist prevents autoloading of modules by alias during hardware
> > detection. Neither softdog or ipmi_watchdog have any alias so they
> > cannot be autoloaded and blacklist is irrelevant here.
> >
> >> The reason to unload watchdog module (ipmi or softdog) is that there seems
>
> > to be a difference between normal reboot and watchdog reboot.
> >> In case of ipmi watchdog timer reboot:
> >> - the system hangs at the end of reboot cycle for some time
> >> - restart seems to be harder (like power off/on cycle), BIOS runs more
> > diagnostics at startup
>
> maybe kdump is enabled in that case?
>
> >> - it turns on HW diagnostic indication on the server front panel (dell
> > server) which stays on forever
> >> - it logs the event to IDRAC, which is unnecessary, because it was not a
> > hardware event, but just a normal reboot
>
> If the hardware watchdog times out and fires, it is consoidered to be an
> exceptional event that will be logged and reported.
>
> >>
> >> In case of "sudo reboot" command, I would like to skip this... so the idea
>
> > is to fully stop the watchdog just before reboot. I am not sure how to do
> > this properly.
> >>
> >> The "softdog" is better in this respect. It does not trigger nothing from
> > the list above, but I still get the message during reboot
> >> [ ... ] watchdog: watchdog0: watchdog did not stop!
> >> ... with some small timeout.
> >>
> >
> > The first obvious question - is there only one watchdog? Some watchdog
> > drivers *are* autoloaded.
> >
> > Is there only one user of watchdog? systemd may use it too as example.
>
> Don't mix timers with a watchdog: It makes little sense to habe multipe
> watchdogs enabled IMHO.

Yep that is an issue atm.

When you have multiple user of a hardware-watchdog like:
watchdog-daemon, sbd, corosync, systemd, ...

I'm not aware of an implementation that would provide multiple watchdog-timers
with the usual char-device-interface out of one physical.
Of course this should be relatively easy to implement - even in user-space.
On our embedded devices we usually had something like a service that
would offer multiple timers to other instances.
The implementation of that service itself was guarded by a hardware-watchdog
so that the derived timers would be as reliable as a hardware-watchdog.
Last implementation was built into watchdog-daemon and offered a dbus-interface.
What systemd has implemented is similarly interesting.
Current systemd-implementation has a suspicious loop around it that prevents
it from being fit for sbd-purposes as it doesn't guarantee a reboot within
a reasonably short time like this.
This is why I haven't yet implemented using the systemd-filedescriptor-approach
in sbd yet (as a configurable alternative to going for the device directly).
Approaching the systemd-guys and asking why it is implemented as it is has
been on my todo-list for a while now.

If you are running multiple-services on a host that don't offer something
like a common supervision main-loop it may make sense to offer a common
instance that offers something like a watchdog-service.
For a node that has all service under pacemaker-control this shouldn't be
needed as we have sbd observing pacemakerd. Pacemakerd in turn
observes the other pacemaker subdaemons (released with RHEL-8.6 and
iirc 2.1.3 upstream) guaranteeing that the monitors on the resources don't
get stuck.

Klaus
>
> >
> >> So after some additional testing, the situation is the following:
> >>
> >> - without any watchdog and without sbd package, the server reboots
> normally
> >> - with "softdog" module loaded, I only get "watchdog did not stop message"
>
> > at reboot
> >> - with "softdog" loaded, but unloaded with "ExecStop=...rmmod", reboot is
> > normal again
> >> - same as above, but with "sbd" package loaded, I am getting "watchdog did
>
> > not stop message" again
> >> - switching from "softdog" to "ipmi_watchdog" gets me to the original list
>
> > of problems
> >>
> >> It looks like the "sbd" is preventing the watchdog to close, so that
> > watchdog triggers always, even in the case of normal reboot. What am I
> > missing here?
>
> The watchdog may have a "no way out" parameter that prevents disabling it
> after enabled once.
>
> >
> > While the only way I can reproduce it on my QEMU VM is "reboot -f"
> > (without stopping all services), there is certainly a race condition in
> > sbd.service.
> >
> > ExecStop=@bindir@/kill -TERM $MAINPID
> >
> >
> > systemd will continue as soon as "kill" completes without waiting for
> > sbd to actually stop. It means systemd may complete shutdown sequence
> > before sbd had chance to react on signal and then simply kill it. Which
> > leaves watchdog armed.
> >
> > For test purpose try to use script that loops until sbd is actually
> > stopped for ExecStop.
> >
> > Note that systemd strongly recommends to use synchronous command for
> > ExecStop (we may argue that this should be handled by service manager
> > itself, but well ...).
> >
> >>
> >> Zoran
> >>
> >> ----- Original Message -----
> >> From: "Andrei Borzenkov" <arvidjaar at gmail.com>
> >> To: "users" <users at clusterlabs.org>
> >> Sent: Friday, June 3, 2022 11:24:03 AM
> >> Subject: Re: [ClusterLabs] normal reboot with active sbd does not work
> >>
> >> On 03.06.2022 11:18, Zoran Bošnjak wrote:
> >>> Hi all,
> >>> I would appreciate an advice about sbd fencing (without shared storage).
> >>>
> >>> I am using ubuntu 20.04., with default packages from the repository
> > (pacemaker, corosync, fence-agents, ipmitool, pcs...).
> >>>
> >>> HW watchdog is present on servers. The first problem was to load/unload
> the
> > watchdog module. For some reason the module is blacklisted on ubuntu,
> >>
> >> What makes you think so?
> >>
> >> bor at bor-Latitude-E5450:~$ lsb_release  -d
> >>
> >> Description: Ubuntu 20.04.4 LTS
> >>
> >> bor at bor-Latitude-E5450:~$ modprobe -c | grep ipmi_watchdog
> >>
> >> bor at bor-Latitude-E5450:~$
> >>
> >>
> >>
> >>
> >>
> >>> so I've created a service for this purpose.
> >>>
> >>
> >> man modules-load.d
> >>
> >>
> >>> --- file: /etc/systemd/system/watchdog.service
> >>> [Unit]
> >>> Description=Load watchdog timer module
> >>> After=syslog.target
> >>>
> >>
> >> Without any explicit dependencies stop will be attempted as soon as
> >> possible.
> >>
> >>> [Service]
> >>> Type=oneshot
> >>> RemainAfterExit=yes
> >>> ExecStart=/sbin/modprobe ipmi_watchdog
> >>> ExecStop=/sbin/rmmod ipmi_watchdog
> >>>
> >>
> >> Why on earth do you need to unload kernel driver when system reboots?
> >>
> >>> [Install]
> >>> WantedBy=multi-user.target
> >>> ---
> >>>
> >>> Is this a proper way to load watchdog module under ubuntu?
> >>>
> >>
> >> There is standard way to load non-autoloaded drivers on *any* systemd
> >> based distribution. Which is modules-load.d.
> >>
> >>> Anyway, once the module is loaded, the /dev/watchdog (which is required by
>
> > 'sbd') is present.
> >>> Next, the 'sbd' is installed by
> >>>
> >>> sudo apt install sbd
> >>> (followed by one reboot to get the sbd active)
> >>>
> >>> The configuration of the 'sbd' is default. The sbd reacts to network
> failure
> > as expected (reboots the server). However, when the 'sbd' is active, the
> > server won't reboot normally any more. For example from the command line
> > "sudo reboot", it gets stuck at the end of the reboot sequence. There is a
> > message on the console:
> >>>
> >>> ... reboot progress
> >>> [ OK ] Finished Reboot.
> >>> [ OK ] Reached target Reboot.
> >>> [ ... ] IPMI Watchdog: Unexpected close, not stopping watchdog!
> >>> [ ... ] IPMI Watchdog: Unexpected close, not stopping watchdog!
> >>> ... it gets stuck at this point
> >>>
> >>> After some long timeout, it looks like the watchdog timer expires and
> server
> > boots, but the failure indication remains on the front panel of the server.
>
> > If I uninstall the 'sbd' package, the "sudo reboot" works normally again.
> >>>
> >>> My question is: How do I configure the system, to have the 'sbd' function
>
> > present, but still be able to reboot the system normally.
> >>>
> >>
> >> As the first step - do not unload watchdog driver on shutdown.
> >> _______________________________________________
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >> _______________________________________________
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



More information about the Users mailing list