[ClusterLabs] Antw: Re: SLES11 SP4:SBD fencing problem with Xen (NMI not handled)?
Klaus Wenninger
kwenning at redhat.com
Mon Jul 30 07:25:33 EDT 2018
On 07/30/2018 12:24 PM, Ulrich Windl wrote:
>>>> Edwin Török <edvin.torok at citrix.com> schrieb am 30.07.2018 um 11:20 in
> Nachricht <44d67d56-d7a7-3af3-64ef-4f24ed0aba6e at citrix.com>:
>> On 30/07/18 08:24, Ulrich Windl wrote:
>>> Hi!
>>>
>>> We have a strange problem on one cluster node running Xen PV VMs (SLES11
>> SP4): After updating the kernel and adding new SBD devices (to replace an
> old
>> storage system), the system just seems to freeze.
>>
>> Hi,
>>
>> Which version of Xen are you using and what Linux distribution is run in
>> Dom0?
> As the subject says: SLES11 SP4
>
>>> Closter inspection showed that SBD seems to send an NMI (for reasons still
>> to be examined), and the current Xen/Kernel seems to be unable to handle the
>> NMI in a way that forces a restart of the server (see attached screen
> shot).
>> Can you show us your kernel boot cmdline, and loaded modules?
>> Which watchdog module did you load? Have you tried xen_wdt?
>> See https://www.suse.com/support/kb/doc/?id=7016880
> The server is a HP DL380 G7, so the wathdog is hpwdt. The basic kernel options
> are simply "earlyprintk=xen nomodeset", and xen has
> "dom0_mem=4096M,max:8192M".
>
> In the meantime I found out that if I disable the sbd watchdog (-W -W, who did
You can use the config-file if you don't like the parameters.
And sbd without watchdog isn't gonna be very useful so
it actually shouldn't matter how ugly disabling is ;-)
> write such terrible code?) the NMI is not sent. I suspect a problem with sbd
Looks more like the watchdog-device used isn't suitable
for the environment it is run in.
Haven't played with sbd on xen but hpwdt looks like a
watchdog for hp-hardware that is probably not gonna
appear as paravirtualized device in xen-vms. It might be
available on dom0 but you need one inside every VM
as far as I got the setup.
At least for test-purposes you could try to go with
softdog as well that doesn't touch any physical/paravirtualized
or physical but not well enough hidden from the guests hardware.
Regards,
Klaus
> using three devices (before the change we only had two), because on startup is
> says three times it's starting the first servant (" sbd: [5904]: info: First
> servant start - zeroing inbox", "sbd: [5903]: info: First servant start -
> zeroing inbox", "sbd: [5901]: info: First servant start - zeroing inbox")...
>
> The other thing is that a latency of 7s is reported (which I doubt very
> much):
> Jul 30 11:37:27 h01 sbd: [5901]: info: Latency: 7 on disk
> /dev/disk/by-id/dm-name-SBD_1-E3
> Jul 30 11:37:27 h01 sbd: [5904]: info: Latency: 7 on disk
> /dev/disk/by-id/dm-name-SBD_1-3P2
> Jul 30 11:37:27 h01 sbd: [5903]: info: Latency: 7 on disk
> /dev/disk/by-id/dm-name-SBD_1-3P1
>
> Regards,
> Ulrich
>
>> Best regards,
>> ‑‑Edwin
>>
>>> The last message I see in the node's cluster log is this:
>>> Jul 27 11:33:32 [15731] h01 cib: info:
>> cib_file_write_with_digest: Reading cluster configuration file
>> /var/lib/pacemaker/cib/cib.YESngs (digest:
> /var/lib/pacemaker/cib/cib.Yutv8O)
>>> Other nodes have these messages:
>>> Jul 27 11:33:32 h05 dlm_controld.pcmk[15810]: dlm_process_node: Skipped
>> active node 739512330: born‑on=3864, last‑seen=3936, this‑event=3936,
>> last‑event=3932
>>> Jul 27 11:33:32 h10 dlm_controld.pcmk[20397]: dlm_process_node: Skipped
>> active node 739512325: born‑on=3856, last‑seen=3936, this‑event=3936,
>> last‑event=3932
>>> Can anybody bring some light into this issue?:
>>> 1) Under what circumstances is an NMI sent by SBD?
>>> 2) What is the reaction expected after receiving an NMI?
>>> 3) If it did work before, what could have gone wrong?
>>>
>>> I wanted to get some feedback from here before asking SLES support...
>>>
>>> Regards,
>>> Ulrich
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
--
Klaus Wenninger
Senior Software Engineer, EMEA ENG Base Operating Systems
Red Hat
kwenning at redhat.com
More information about the Users
mailing list