[ClusterLabs] Antw: [EXT] Re: normal reboot with active sbd does not work

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Tue Jun 7 01:52:51 EDT 2022


>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 03.06.2022 um 17:04 in
Nachricht <99f7746a-c962-33bb-6737-f88ba0128a7c at gmail.com>:
> On 03.06.2022 16:51, Zoran Bošnjak wrote:
>> Thanks for all your answers. Sorry, my mistake. The ipmi_watchdog is indeed

> OK. I was first experimenting with "softdog", which is blacklisted. So the 
> reasonable question is how to properly start "softdog" on ubuntu.
>> 
> 
> blacklist prevents autoloading of modules by alias during hardware
> detection. Neither softdog or ipmi_watchdog have any alias so they
> cannot be autoloaded and blacklist is irrelevant here.
> 
>> The reason to unload watchdog module (ipmi or softdog) is that there seems

> to be a difference between normal reboot and watchdog reboot.
>> In case of ipmi watchdog timer reboot:
>> - the system hangs at the end of reboot cycle for some time
>> - restart seems to be harder (like power off/on cycle), BIOS runs more 
> diagnostics at startup

maybe kdump is enabled in that case?

>> - it turns on HW diagnostic indication on the server front panel (dell 
> server) which stays on forever
>> - it logs the event to IDRAC, which is unnecessary, because it was not a 
> hardware event, but just a normal reboot

If the hardware watchdog times out and fires, it is consoidered to be an
exceptional event that will be logged and reported.

>> 
>> In case of "sudo reboot" command, I would like to skip this... so the idea

> is to fully stop the watchdog just before reboot. I am not sure how to do 
> this properly.
>> 
>> The "softdog" is better in this respect. It does not trigger nothing from 
> the list above, but I still get the message during reboot
>> [ ... ] watchdog: watchdog0: watchdog did not stop!
>> ... with some small timeout.
>> 
> 
> The first obvious question - is there only one watchdog? Some watchdog
> drivers *are* autoloaded.
> 
> Is there only one user of watchdog? systemd may use it too as example.

Don't mix timers with a watchdog: It makes little sense to habe multipe
watchdogs enabled IMHO.

> 
>> So after some additional testing, the situation is the following:
>> 
>> - without any watchdog and without sbd package, the server reboots
normally
>> - with "softdog" module loaded, I only get "watchdog did not stop message"

> at reboot
>> - with "softdog" loaded, but unloaded with "ExecStop=...rmmod", reboot is 
> normal again
>> - same as above, but with "sbd" package loaded, I am getting "watchdog did

> not stop message" again
>> - switching from "softdog" to "ipmi_watchdog" gets me to the original list

> of problems
>> 
>> It looks like the "sbd" is preventing the watchdog to close, so that 
> watchdog triggers always, even in the case of normal reboot. What am I 
> missing here?

The watchdog may have a "no way out" parameter that prevents disabling it
after enabled once.

> 
> While the only way I can reproduce it on my QEMU VM is "reboot -f"
> (without stopping all services), there is certainly a race condition in
> sbd.service.
> 
> ExecStop=@bindir@/kill -TERM $MAINPID
> 
> 
> systemd will continue as soon as "kill" completes without waiting for
> sbd to actually stop. It means systemd may complete shutdown sequence
> before sbd had chance to react on signal and then simply kill it. Which
> leaves watchdog armed.
> 
> For test purpose try to use script that loops until sbd is actually
> stopped for ExecStop.
> 
> Note that systemd strongly recommends to use synchronous command for
> ExecStop (we may argue that this should be handled by service manager
> itself, but well ...).
> 
>> 
>> Zoran
>> 
>> ----- Original Message -----
>> From: "Andrei Borzenkov" <arvidjaar at gmail.com>
>> To: "users" <users at clusterlabs.org>
>> Sent: Friday, June 3, 2022 11:24:03 AM
>> Subject: Re: [ClusterLabs] normal reboot with active sbd does not work
>> 
>> On 03.06.2022 11:18, Zoran Bošnjak wrote:
>>> Hi all,
>>> I would appreciate an advice about sbd fencing (without shared storage).
>>>
>>> I am using ubuntu 20.04., with default packages from the repository 
> (pacemaker, corosync, fence-agents, ipmitool, pcs...).
>>>
>>> HW watchdog is present on servers. The first problem was to load/unload
the 
> watchdog module. For some reason the module is blacklisted on ubuntu,
>> 
>> What makes you think so?
>> 
>> bor at bor-Latitude-E5450:~$ lsb_release  -d
>> 
>> Description:	Ubuntu 20.04.4 LTS
>> 
>> bor at bor-Latitude-E5450:~$ modprobe -c | grep ipmi_watchdog
>> 
>> bor at bor-Latitude-E5450:~$
>> 
>> 
>> 
>> 
>> 
>>> so I've created a service for this purpose.
>>>
>> 
>> man modules-load.d
>> 
>> 
>>> --- file: /etc/systemd/system/watchdog.service
>>> [Unit]
>>> Description=Load watchdog timer module
>>> After=syslog.target
>>>
>> 
>> Without any explicit dependencies stop will be attempted as soon as
>> possible.
>> 
>>> [Service]
>>> Type=oneshot
>>> RemainAfterExit=yes
>>> ExecStart=/sbin/modprobe ipmi_watchdog
>>> ExecStop=/sbin/rmmod ipmi_watchdog
>>>
>> 
>> Why on earth do you need to unload kernel driver when system reboots?
>> 
>>> [Install]
>>> WantedBy=multi-user.target
>>> ---
>>>
>>> Is this a proper way to load watchdog module under ubuntu?
>>>
>> 
>> There is standard way to load non-autoloaded drivers on *any* systemd
>> based distribution. Which is modules-load.d.
>> 
>>> Anyway, once the module is loaded, the /dev/watchdog (which is required by

> 'sbd') is present.
>>> Next, the 'sbd' is installed by
>>>
>>> sudo apt install sbd
>>> (followed by one reboot to get the sbd active)
>>>
>>> The configuration of the 'sbd' is default. The sbd reacts to network
failure 
> as expected (reboots the server). However, when the 'sbd' is active, the 
> server won't reboot normally any more. For example from the command line 
> "sudo reboot", it gets stuck at the end of the reboot sequence. There is a 
> message on the console:
>>>
>>> ... reboot progress
>>> [ OK ] Finished Reboot.
>>> [ OK ] Reached target Reboot.
>>> [ ... ] IPMI Watchdog: Unexpected close, not stopping watchdog!
>>> [ ... ] IPMI Watchdog: Unexpected close, not stopping watchdog!
>>> ... it gets stuck at this point
>>>
>>> After some long timeout, it looks like the watchdog timer expires and
server 
> boots, but the failure indication remains on the front panel of the server.

> If I uninstall the 'sbd' package, the "sudo reboot" works normally again.
>>>
>>> My question is: How do I configure the system, to have the 'sbd' function

> present, but still be able to reboot the system normally.
>>>
>> 
>> As the first step - do not unload watchdog driver on shutdown.
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 





More information about the Users mailing list