[ClusterLabs] Antw: Re: SLES11 SP4:SBD fencing problem with Xen (NMI not handled)?
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Mon Jul 30 06:24:31 EDT 2018
>>> Edwin Török <edvin.torok at citrix.com> schrieb am 30.07.2018 um 11:20 in
Nachricht <44d67d56-d7a7-3af3-64ef-4f24ed0aba6e at citrix.com>:
> On 30/07/18 08:24, Ulrich Windl wrote:
>> Hi!
>>
>> We have a strange problem on one cluster node running Xen PV VMs (SLES11
> SP4): After updating the kernel and adding new SBD devices (to replace an
old
> storage system), the system just seems to freeze.
>
> Hi,
>
> Which version of Xen are you using and what Linux distribution is run in
> Dom0?
As the subject says: SLES11 SP4
>
>> Closter inspection showed that SBD seems to send an NMI (for reasons still
> to be examined), and the current Xen/Kernel seems to be unable to handle the
> NMI in a way that forces a restart of the server (see attached screen
shot).
>
> Can you show us your kernel boot cmdline, and loaded modules?
> Which watchdog module did you load? Have you tried xen_wdt?
> See https://www.suse.com/support/kb/doc/?id=7016880
The server is a HP DL380 G7, so the wathdog is hpwdt. The basic kernel options
are simply "earlyprintk=xen nomodeset", and xen has
"dom0_mem=4096M,max:8192M".
In the meantime I found out that if I disable the sbd watchdog (-W -W, who did
write such terrible code?) the NMI is not sent. I suspect a problem with sbd
using three devices (before the change we only had two), because on startup is
says three times it's starting the first servant (" sbd: [5904]: info: First
servant start - zeroing inbox", "sbd: [5903]: info: First servant start -
zeroing inbox", "sbd: [5901]: info: First servant start - zeroing inbox")...
The other thing is that a latency of 7s is reported (which I doubt very
much):
Jul 30 11:37:27 h01 sbd: [5901]: info: Latency: 7 on disk
/dev/disk/by-id/dm-name-SBD_1-E3
Jul 30 11:37:27 h01 sbd: [5904]: info: Latency: 7 on disk
/dev/disk/by-id/dm-name-SBD_1-3P2
Jul 30 11:37:27 h01 sbd: [5903]: info: Latency: 7 on disk
/dev/disk/by-id/dm-name-SBD_1-3P1
Regards,
Ulrich
>
> Best regards,
> ‑‑Edwin
>
>>
>> The last message I see in the node's cluster log is this:
>> Jul 27 11:33:32 [15731] h01 cib: info:
> cib_file_write_with_digest: Reading cluster configuration file
> /var/lib/pacemaker/cib/cib.YESngs (digest:
/var/lib/pacemaker/cib/cib.Yutv8O)
>>
>> Other nodes have these messages:
>> Jul 27 11:33:32 h05 dlm_controld.pcmk[15810]: dlm_process_node: Skipped
> active node 739512330: born‑on=3864, last‑seen=3936, this‑event=3936,
> last‑event=3932
>>
>> Jul 27 11:33:32 h10 dlm_controld.pcmk[20397]: dlm_process_node: Skipped
> active node 739512325: born‑on=3856, last‑seen=3936, this‑event=3936,
> last‑event=3932
>>
>> Can anybody bring some light into this issue?:
>> 1) Under what circumstances is an NMI sent by SBD?
>> 2) What is the reaction expected after receiving an NMI?
>> 3) If it did work before, what could have gone wrong?
>>
>> I wanted to get some feedback from here before asking SLES support...
>>
>> Regards,
>> Ulrich
>>
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list