[ClusterLabs] Antw: [EXT] Re: Still Beginner STONITH Problem

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Mon Jul 13 05:32:20 EDT 2020


>>> Klaus Wenninger <kwenning at redhat.com> schrieb am 07.07.2020 um 11:36 in
Nachricht <e5f6407e-a4b0-7257-aeb9-02b9e9583d92 at redhat.com>:
> On 7/7/20 11:12 AM, Strahil Nikolov wrote:
>>> With kvm please use the qemu‑watchdog and try to
>>> prevent using softdogwith SBD.
>>> Especially if you are aiming for a production‑cluster ...
>> You can tell it to the previous company  I worked  for :D .
>> All clusters were  using softdog on SLES 11/12 despite  the hardware had 
> it's own.
> Yes I know opinions regarding softdog do diverge a bit.

Some time ago (no cluster involved) I had a hard system-freeze with SUSE Leap
15.1, obviously while swapping to/from an encrypted SSD.
My guess is that softdog would not be able to reset that state, and SBD also
won't unless using the hardware watchdog.

But we also had an issue with the hardware watchdog (it did reset, but the
machine didn't boot). It required some parameter change which I cannot remember
any more (sorry!).
It seems to have been related to "unexpected NMI" and the setting of "panic=".
Some text I found says:
---
The watchdog hardware is programmed to send the Linux kernel an NMI prior
to the hardware resetting the system.  In a totally unresponsive system,
the NMI wouldn't be processed and the hardware would reset the system.

If the system is responsive, the timeout is canceled and panic is called
to initiate a crash.  If kdump is properly configured, a crash dump should
be collected and the system rebooted afterwards.

If kdump is not configured, a tombstone message is displayed upon the
console.  Dependent upon whether the linux kernel is booted with the
command line parameter "panic=N", the system will either reset
after N seconds, or if the parameter is not specified, the system will
sit forever waiting for a human to reset it.  This "wait forever" is
to allow user to see/record reason system crashed.

The default configuration for sles11sp4 is to configure kdump on at
install time.  By default Sles11sp4 does not specify the "panic=N" 
parameter to Linux command line.

---

Regards,
Ulrich

> Going through some possible kernel‑paths at least leaves
> a bad taste. Doesn't mean you will have issues though.
> Just something where testing won't give you an easy
> answer. May as well depend heavily on the hardware
> you are running on.
> As long as there are better possibilities one should at
> least consider them. Remember to have defaulted to
> softdog on a pre‑configured product‑installer with the
> documentation stating that softdog has it's shortcomings
> and an advise to configure something else if available,
> you know what you are doing and you have tested it
> (testing if a hardware watchdog actually fires is
> easy while it is merely impossible to test‑verify
> if softdog is really reliable enough).
> 
> Klaus
>>
>> We  had  no issues with fencing,  but we got plenty of san issues to test 
> the fencing :)
>>
>> Best Regards,
>> Strahil Nikolov
>>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 





More information about the Users mailing list