[ClusterLabs] Antw: Re: SLES11 SP4:SBD fencing problem with Xen (NMI not handled)?

Mon Jul 30 06:24:31 EDT 2018

>>> Edwin Török <edvin.torok at citrix.com> schrieb am 30.07.2018 um 11:20 in
Nachricht <44d67d56-d7a7-3af3-64ef-4f24ed0aba6e at citrix.com>:
> On 30/07/18 08:24, Ulrich Windl wrote:
>> Hi!
>> 
>> We have a strange problem on one cluster node running Xen PV VMs (SLES11 
> SP4): After updating the kernel and adding new SBD devices (to replace an
old 
> storage system), the system just seems to freeze.
> 
> Hi,
> 
> Which version of Xen are you using and what Linux distribution is run in
> Dom0?

As the subject says: SLES11 SP4

> 
>> Closter inspection showed that SBD seems to send an NMI (for reasons still

> to be examined), and the current Xen/Kernel seems to be unable to handle the

> NMI in a way that forces a restart of the server (see attached screen
shot).
> 
> Can you show us your kernel boot cmdline, and loaded modules?
> Which watchdog module did you load? Have you tried xen_wdt?
> See https://www.suse.com/support/kb/doc/?id=7016880 

The server is a HP DL380 G7, so the wathdog is hpwdt. The basic kernel options
are simply "earlyprintk=xen nomodeset", and xen has
"dom0_mem=4096M,max:8192M".

In the meantime I found out that if I disable the sbd watchdog (-W -W, who did
write such terrible code?) the NMI is not sent. I suspect a problem with sbd
using three devices (before the change we only had two), because on startup is
says three times it's starting the first servant (" sbd: [5904]: info: First
servant start - zeroing inbox", "sbd: [5903]: info: First servant start -
zeroing inbox", "sbd: [5901]: info: First servant start - zeroing inbox")...

The other thing is that a latency of 7s is reported (which I doubt very
much):
Jul 30 11:37:27 h01 sbd: [5901]: info: Latency: 7 on disk
/dev/disk/by-id/dm-name-SBD_1-E3
Jul 30 11:37:27 h01 sbd: [5904]: info: Latency: 7 on disk
/dev/disk/by-id/dm-name-SBD_1-3P2
Jul 30 11:37:27 h01 sbd: [5903]: info: Latency: 7 on disk
/dev/disk/by-id/dm-name-SBD_1-3P1

Regards,
Ulrich

> 
> Best regards,
> ‑‑Edwin
> 
>> 
>> The last message I see in the node's cluster log is this:
>> Jul 27 11:33:32 [15731] h01        cib:     info: 
> cib_file_write_with_digest:      Reading cluster configuration file 
> /var/lib/pacemaker/cib/cib.YESngs (digest:
/var/lib/pacemaker/cib/cib.Yutv8O)
>> 
>> Other nodes have these messages:
>> Jul 27 11:33:32 h05 dlm_controld.pcmk[15810]: dlm_process_node: Skipped 
> active node 739512330: born‑on=3864, last‑seen=3936, this‑event=3936, 
> last‑event=3932
>> 
>> Jul 27 11:33:32 h10 dlm_controld.pcmk[20397]: dlm_process_node: Skipped 
> active node 739512325: born‑on=3856, last‑seen=3936, this‑event=3936, 
> last‑event=3932
>> 
>> Can anybody bring some light into this issue?:
>> 1) Under what circumstances is an NMI sent by SBD?
>> 2) What is the reaction expected after receiving an NMI?
>> 3) If it did work before, what could have gone wrong?
>> 
>> I wanted to get some feedback from here before asking SLES support...
>> 
>> Regards,
>> Ulrich
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org