[ClusterLabs] Antw: Re: SLES11 SP4:SBD fencing problem with Xen (NMI not handled)?

Mon Jul 30 11:25:33 UTC 2018

On 07/30/2018 12:24 PM, Ulrich Windl wrote:
>>>> Edwin Török <edvin.torok at citrix.com> schrieb am 30.07.2018 um 11:20 in
> Nachricht <44d67d56-d7a7-3af3-64ef-4f24ed0aba6e at citrix.com>:
>> On 30/07/18 08:24, Ulrich Windl wrote:
>>> Hi!
>>>
>>> We have a strange problem on one cluster node running Xen PV VMs (SLES11 
>> SP4): After updating the kernel and adding new SBD devices (to replace an
> old 
>> storage system), the system just seems to freeze.
>>
>> Hi,
>>
>> Which version of Xen are you using and what Linux distribution is run in
>> Dom0?
> As the subject says: SLES11 SP4
>
>>> Closter inspection showed that SBD seems to send an NMI (for reasons still
>> to be examined), and the current Xen/Kernel seems to be unable to handle the
>> NMI in a way that forces a restart of the server (see attached screen
> shot).
>> Can you show us your kernel boot cmdline, and loaded modules?
>> Which watchdog module did you load? Have you tried xen_wdt?
>> See https://www.suse.com/support/kb/doc/?id=7016880 
> The server is a HP DL380 G7, so the wathdog is hpwdt. The basic kernel options
> are simply "earlyprintk=xen nomodeset", and xen has
> "dom0_mem=4096M,max:8192M".
>
> In the meantime I found out that if I disable the sbd watchdog (-W -W, who did

You can use the config-file if you don't like the parameters.
And sbd without watchdog isn't gonna be very useful so
it actually shouldn't matter how ugly disabling is ;-)

> write such terrible code?) the NMI is not sent. I suspect a problem with sbd

Looks more like the watchdog-device used isn't suitable
for the environment it is run in.
Haven't played with sbd on xen but hpwdt looks like a
watchdog for hp-hardware that is probably not gonna
appear as paravirtualized device in xen-vms. It might be
available on dom0 but you need one inside every VM
as far as I got the setup.
At least for test-purposes you could try to go with
softdog as well that doesn't touch any physical/paravirtualized
or physical but not well enough hidden from the guests hardware.

Regards,
Klaus

> using three devices (before the change we only had two), because on startup is
> says three times it's starting the first servant (" sbd: [5904]: info: First
> servant start - zeroing inbox", "sbd: [5903]: info: First servant start -
> zeroing inbox", "sbd: [5901]: info: First servant start - zeroing inbox")...
>
> The other thing is that a latency of 7s is reported (which I doubt very
> much):
> Jul 30 11:37:27 h01 sbd: [5901]: info: Latency: 7 on disk
> /dev/disk/by-id/dm-name-SBD_1-E3
> Jul 30 11:37:27 h01 sbd: [5904]: info: Latency: 7 on disk
> /dev/disk/by-id/dm-name-SBD_1-3P2
> Jul 30 11:37:27 h01 sbd: [5903]: info: Latency: 7 on disk
> /dev/disk/by-id/dm-name-SBD_1-3P1
>
> Regards,
> Ulrich
>
>> Best regards,
>> ‑‑Edwin
>>
>>> The last message I see in the node's cluster log is this:
>>> Jul 27 11:33:32 [15731] h01        cib:     info: 
>> cib_file_write_with_digest:      Reading cluster configuration file 
>> /var/lib/pacemaker/cib/cib.YESngs (digest:
> /var/lib/pacemaker/cib/cib.Yutv8O)
>>> Other nodes have these messages:
>>> Jul 27 11:33:32 h05 dlm_controld.pcmk[15810]: dlm_process_node: Skipped 
>> active node 739512330: born‑on=3864, last‑seen=3936, this‑event=3936, 
>> last‑event=3932
>>> Jul 27 11:33:32 h10 dlm_controld.pcmk[20397]: dlm_process_node: Skipped 
>> active node 739512325: born‑on=3856, last‑seen=3936, this‑event=3936, 
>> last‑event=3932
>>> Can anybody bring some light into this issue?:
>>> 1) Under what circumstances is an NMI sent by SBD?
>>> 2) What is the reaction expected after receiving an NMI?
>>> 3) If it did work before, what could have gone wrong?
>>>
>>> I wanted to get some feedback from here before asking SLES support...
>>>
>>> Regards,
>>> Ulrich
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org 
>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>
>>> Project Home: http://www.clusterlabs.org 
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 
Klaus Wenninger

Senior Software Engineer, EMEA ENG Base Operating Systems

Red Hat

kwenning at redhat.com