[ClusterLabs] Antw: [EXT] Re: normal reboot with active sbd does not work
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Tue Jun 7 01:52:51 EDT 2022
>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 03.06.2022 um 17:04 in
Nachricht <99f7746a-c962-33bb-6737-f88ba0128a7c at gmail.com>:
> On 03.06.2022 16:51, Zoran Bošnjak wrote:
>> Thanks for all your answers. Sorry, my mistake. The ipmi_watchdog is indeed
> OK. I was first experimenting with "softdog", which is blacklisted. So the
> reasonable question is how to properly start "softdog" on ubuntu.
>>
>
> blacklist prevents autoloading of modules by alias during hardware
> detection. Neither softdog or ipmi_watchdog have any alias so they
> cannot be autoloaded and blacklist is irrelevant here.
>
>> The reason to unload watchdog module (ipmi or softdog) is that there seems
> to be a difference between normal reboot and watchdog reboot.
>> In case of ipmi watchdog timer reboot:
>> - the system hangs at the end of reboot cycle for some time
>> - restart seems to be harder (like power off/on cycle), BIOS runs more
> diagnostics at startup
maybe kdump is enabled in that case?
>> - it turns on HW diagnostic indication on the server front panel (dell
> server) which stays on forever
>> - it logs the event to IDRAC, which is unnecessary, because it was not a
> hardware event, but just a normal reboot
If the hardware watchdog times out and fires, it is consoidered to be an
exceptional event that will be logged and reported.
>>
>> In case of "sudo reboot" command, I would like to skip this... so the idea
> is to fully stop the watchdog just before reboot. I am not sure how to do
> this properly.
>>
>> The "softdog" is better in this respect. It does not trigger nothing from
> the list above, but I still get the message during reboot
>> [ ... ] watchdog: watchdog0: watchdog did not stop!
>> ... with some small timeout.
>>
>
> The first obvious question - is there only one watchdog? Some watchdog
> drivers *are* autoloaded.
>
> Is there only one user of watchdog? systemd may use it too as example.
Don't mix timers with a watchdog: It makes little sense to habe multipe
watchdogs enabled IMHO.
>
>> So after some additional testing, the situation is the following:
>>
>> - without any watchdog and without sbd package, the server reboots
normally
>> - with "softdog" module loaded, I only get "watchdog did not stop message"
> at reboot
>> - with "softdog" loaded, but unloaded with "ExecStop=...rmmod", reboot is
> normal again
>> - same as above, but with "sbd" package loaded, I am getting "watchdog did
> not stop message" again
>> - switching from "softdog" to "ipmi_watchdog" gets me to the original list
> of problems
>>
>> It looks like the "sbd" is preventing the watchdog to close, so that
> watchdog triggers always, even in the case of normal reboot. What am I
> missing here?
The watchdog may have a "no way out" parameter that prevents disabling it
after enabled once.
>
> While the only way I can reproduce it on my QEMU VM is "reboot -f"
> (without stopping all services), there is certainly a race condition in
> sbd.service.
>
> ExecStop=@bindir@/kill -TERM $MAINPID
>
>
> systemd will continue as soon as "kill" completes without waiting for
> sbd to actually stop. It means systemd may complete shutdown sequence
> before sbd had chance to react on signal and then simply kill it. Which
> leaves watchdog armed.
>
> For test purpose try to use script that loops until sbd is actually
> stopped for ExecStop.
>
> Note that systemd strongly recommends to use synchronous command for
> ExecStop (we may argue that this should be handled by service manager
> itself, but well ...).
>
>>
>> Zoran
>>
>> ----- Original Message -----
>> From: "Andrei Borzenkov" <arvidjaar at gmail.com>
>> To: "users" <users at clusterlabs.org>
>> Sent: Friday, June 3, 2022 11:24:03 AM
>> Subject: Re: [ClusterLabs] normal reboot with active sbd does not work
>>
>> On 03.06.2022 11:18, Zoran Bošnjak wrote:
>>> Hi all,
>>> I would appreciate an advice about sbd fencing (without shared storage).
>>>
>>> I am using ubuntu 20.04., with default packages from the repository
> (pacemaker, corosync, fence-agents, ipmitool, pcs...).
>>>
>>> HW watchdog is present on servers. The first problem was to load/unload
the
> watchdog module. For some reason the module is blacklisted on ubuntu,
>>
>> What makes you think so?
>>
>> bor at bor-Latitude-E5450:~$ lsb_release -d
>>
>> Description: Ubuntu 20.04.4 LTS
>>
>> bor at bor-Latitude-E5450:~$ modprobe -c | grep ipmi_watchdog
>>
>> bor at bor-Latitude-E5450:~$
>>
>>
>>
>>
>>
>>> so I've created a service for this purpose.
>>>
>>
>> man modules-load.d
>>
>>
>>> --- file: /etc/systemd/system/watchdog.service
>>> [Unit]
>>> Description=Load watchdog timer module
>>> After=syslog.target
>>>
>>
>> Without any explicit dependencies stop will be attempted as soon as
>> possible.
>>
>>> [Service]
>>> Type=oneshot
>>> RemainAfterExit=yes
>>> ExecStart=/sbin/modprobe ipmi_watchdog
>>> ExecStop=/sbin/rmmod ipmi_watchdog
>>>
>>
>> Why on earth do you need to unload kernel driver when system reboots?
>>
>>> [Install]
>>> WantedBy=multi-user.target
>>> ---
>>>
>>> Is this a proper way to load watchdog module under ubuntu?
>>>
>>
>> There is standard way to load non-autoloaded drivers on *any* systemd
>> based distribution. Which is modules-load.d.
>>
>>> Anyway, once the module is loaded, the /dev/watchdog (which is required by
> 'sbd') is present.
>>> Next, the 'sbd' is installed by
>>>
>>> sudo apt install sbd
>>> (followed by one reboot to get the sbd active)
>>>
>>> The configuration of the 'sbd' is default. The sbd reacts to network
failure
> as expected (reboots the server). However, when the 'sbd' is active, the
> server won't reboot normally any more. For example from the command line
> "sudo reboot", it gets stuck at the end of the reboot sequence. There is a
> message on the console:
>>>
>>> ... reboot progress
>>> [ OK ] Finished Reboot.
>>> [ OK ] Reached target Reboot.
>>> [ ... ] IPMI Watchdog: Unexpected close, not stopping watchdog!
>>> [ ... ] IPMI Watchdog: Unexpected close, not stopping watchdog!
>>> ... it gets stuck at this point
>>>
>>> After some long timeout, it looks like the watchdog timer expires and
server
> boots, but the failure indication remains on the front panel of the server.
> If I uninstall the 'sbd' package, the "sudo reboot" works normally again.
>>>
>>> My question is: How do I configure the system, to have the 'sbd' function
> present, but still be able to reboot the system normally.
>>>
>>
>> As the first step - do not unload watchdog driver on shutdown.
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list