[ClusterLabs] Antw: [EXT] Re: Antw: Instable SLES15 SP3 kernel

Thu Apr 28 01:44:10 EDT 2022

>>> "Gao,Yan" <ygao at suse.com> schrieb am 27.04.2022 um 14:31 in Nachricht
<90862536-bbfb-f8b3-1a80-d8e9c1022293 at suse.com>:
> Hi Ulrich,
> 
> On 2022/4/27 11:13, Ulrich Windl wrote:
>> Update for the Update:
>> 
>> I had installed SLES Updates in one VM and rebooted it via cluster. While
>> installing the updates in the VM the Xen host got RAM corruption (it seems 
> any
>> disk I/O on the host, either locally or via a VM image causes RAM 
> corruption):
> 
> I totally understand your frustrations on this, but I don't really see 
> how much the potential kernel issue is relevant to this mailing list.

Well, you use some HA solution based on pacemaker, and that solution fails miserably.
I guess that users don't want to have the same experience while relying on their services to run.
The other thing is a kind of "product warning": Don't use SLES15 SP3 now with Xen nad cluster if your really want HA.
That was my idea.

I understand that SUSE does not like to see such messsges in public, but maybe "24x7" support should try a bit harder to solve the issue or at least provide a wor-around than they did.

> 
> I believe SUSE support has been working and trying to address it and 
> they will update you once there's further progress.

Well, it's more than two months since reporting... No need to say more.

> 
> About the topics related to cluster, please find the comments in below.

OK.

> 
>> 
>> Apr 27 10:56:44 h19 kernel: pacemaker-execd[39797]: segfault at 3a46 ip
>> 0000000000003a46 sp 00007ffd1c92e8e8 error 14 in
>> pacemaker-execd[5565921cc000+b000]
>> 
>> Fortunately that wasn't fatal and my rescue script kicked in before things 
> get
>> really bad:
>> Apr 27 11:00:01 h19 reboot-before-panic[40630]: RAM corruption detected,
>> starting pro-active reboot
>> 
>> All VMs could be live-migrated away before reboot, but this SLES release is
>> completely unusable!
>> 
>> Regards,
>> Ulrich
>> 
>> 
>> 
>>>>> Ulrich Windl schrieb am 27.04.2022 um 08:02 in Nachricht <6268DC91.C1D :
>> 161 :
>> 60728>:
>>> Hi!
>>>
>>> I want to give a non-update on the issue:
>>> The kernel still segfaults random processes, and there is really nothing
>>> from support within two months that could help improve the situation.
>>> The cluster is logging all kinds on non-funny messages like these:
>>>
>>> Apr 27 02:20:49 h18 systemd-coredump[22319]: [] Process 22317 (controld)
>>> of user 0 dumped core.
>>> Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000246ea08b
>>> idx:1 val:3
>>> Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000259b58a0
>>> idx:1 val:7
>>> Apr 27 02:20:49 h18 controld(prm_DLM)[22330]: ERROR: Uncontrolled lockspace
>> 
>>> exists, system must reboot. Executing suicide fencing
>>>
>>> For a hypervisor host this means that many VMs are reset the hard way!
>>> Other resources weren't stopped properly, too, of course.
>>>
>>>
>>> There also two NULL-pointer outputs in messages on the DC:
>>> Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Found 18 entries
>>> for 118/(null): 0 in progress, 17 completed
>>> Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Node 118/(null)
>>> last kicked at: 1650418762
>>>
>>> I guess that NULL pointer should have been the host name (h18) in reality.
> 
> It's as expected being NULL here. DLM requests fencing through 
> pacemaker's stonith api targeting a node by its corosync nodeid (118 
> here), which it has the knowledge of, rather than the node name. 
> Pacemaker will do the interpretation and eventually issue the fencing.
> 
>>>
>>> Also it seems h18 fenced itself, and the DC h16 seeing that wants to fence
>>> again (to make sure, maybe), but there is some odd problem:
>>>
>>> Apr 27 02:21:07 h16 pacemaker-controld[7453]:  notice: Requesting fencing
>>> (reboot) of node h18
>>> Apr 27 02:21:07 h16 pacemaker-fenced[7443]:  notice: Client
>>> pacemaker-controld.7453.a9d67c8b wants to fence (reboot) 'h18' with device
>>> '(any)'
>>> Apr 27 02:21:07 h16 pacemaker-fenced[7443]:  notice: Merging stonith action
>> 
>>> 'reboot' targeting h18 originating from client
>>> pacemaker-controld.7453.73d8bbd6 with identical request from
>>> stonith-api.39797 at h16.ea22f429 (360>
> 
> This is also as expected when DLM is used. Despite the fencing 
> previously proactively requested by DLM, pacemaker also has its reason 
> to issue a fencing targeting the node. And fenced daemons is aware 
> there's already the pending/on-going fencing targeting the same node, so 
> it doesn't really need to issue it once again.
> 
>>>
>>> Apr 27 02:22:52 h16 pacemaker-fenced[7443]:  warning: fence_legacy_reboot_1
>> 
>>> process (PID 39749) timed out
>>> Apr 27 02:22:52 h16 pacemaker-fenced[7443]:  warning:
>>> fence_legacy_reboot_1[39749] timed out after 120000ms
>>> Apr 27 02:22:52 h16 pacemaker-fenced[7443]:  error: Operation 'reboot'
>>> [39749] (call 2 from stonith_admin.controld.22336) for host 'h18' with
>> device
>>> 'prm_stonith_sbd' returned: -62 (Timer expired)
> 
> Please make sure:
> stonith-timeout > sbd_msgwait + pcmk_delay_max

Checked that; it's true.

> 
> If it was already the case, probably sbd was encountering certain 
> difficulties writing the poison pill at that time ...

Yes, as said before: With that kernel, most HA mechanisms just fail.

Regards,
Ulrich

> 
> Regards,
>    Yan
> 
>>>
>>> I never saw such message before. Evenbtually:
>>>
>>> Apr 27 02:24:53 h16 pacemaker-controld[7453]:  notice: Stonith operation
>>> 31/1:3347:0:48bafcab-fecf-4ea0-84a8-c31ab1694b3a: OK (0)
>>> Apr 27 02:24:53 h16 pacemaker-controld[7453]:  notice: Peer h18 was
>>> terminated (reboot) by h16 on behalf of pacemaker-controld.7453: OK
>>>
>>> The olny thing I found out was that the kernel running without Xen does not
>> 
>>> show RAM corruption.
>>>
>>> Regards,
>>> Ulrich
>>>
>>>
>>>
>>>
>> 
>> 
>>