[ClusterLabs] Antw: Instable SLES15 SP3 kernel

Wed Apr 27 08:31:59 EDT 2022

Hi Ulrich,

On 2022/4/27 11:13, Ulrich Windl wrote:
> Update for the Update:
> 
> I had installed SLES Updates in one VM and rebooted it via cluster. While
> installing the updates in the VM the Xen host got RAM corruption (it seems any
> disk I/O on the host, either locally or via a VM image causes RAM corruption):

I totally understand your frustrations on this, but I don't really see 
how much the potential kernel issue is relevant to this mailing list.

I believe SUSE support has been working and trying to address it and 
they will update you once there's further progress.

About the topics related to cluster, please find the comments in below.

> 
> Apr 27 10:56:44 h19 kernel: pacemaker-execd[39797]: segfault at 3a46 ip
> 0000000000003a46 sp 00007ffd1c92e8e8 error 14 in
> pacemaker-execd[5565921cc000+b000]
> 
> Fortunately that wasn't fatal and my rescue script kicked in before things get
> really bad:
> Apr 27 11:00:01 h19 reboot-before-panic[40630]: RAM corruption detected,
> starting pro-active reboot
> 
> All VMs could be live-migrated away before reboot, but this SLES release is
> completely unusable!
> 
> Regards,
> Ulrich
> 
> 
> 
>>>> Ulrich Windl schrieb am 27.04.2022 um 08:02 in Nachricht <6268DC91.C1D :
> 161 :
> 60728>:
>> Hi!
>>
>> I want to give a non-update on the issue:
>> The kernel still segfaults random processes, and there is really nothing
>> from support within two months that could help improve the situation.
>> The cluster is logging all kinds on non-funny messages like these:
>>
>> Apr 27 02:20:49 h18 systemd-coredump[22319]: [����] Process 22317 (controld)
>> of user 0 dumped core.
>> Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000246ea08b
>> idx:1 val:3
>> Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000259b58a0
>> idx:1 val:7
>> Apr 27 02:20:49 h18 controld(prm_DLM)[22330]: ERROR: Uncontrolled lockspace
> 
>> exists, system must reboot. Executing suicide fencing
>>
>> For a hypervisor host this means that many VMs are reset the hard way!
>> Other resources weren't stopped properly, too, of course.
>>
>>
>> There also two NULL-pointer outputs in messages on the DC:
>> Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Found 18 entries
>> for 118/(null): 0 in progress, 17 completed
>> Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Node 118/(null)
>> last kicked at: 1650418762
>>
>> I guess that NULL pointer should have been the host name (h18) in reality.

It's as expected being NULL here. DLM requests fencing through 
pacemaker's stonith api targeting a node by its corosync nodeid (118 
here), which it has the knowledge of, rather than the node name. 
Pacemaker will do the interpretation and eventually issue the fencing.

>>
>> Also it seems h18 fenced itself, and the DC h16 seeing that wants to fence
>> again (to make sure, maybe), but there is some odd problem:
>>
>> Apr 27 02:21:07 h16 pacemaker-controld[7453]:  notice: Requesting fencing
>> (reboot) of node h18
>> Apr 27 02:21:07 h16 pacemaker-fenced[7443]:  notice: Client
>> pacemaker-controld.7453.a9d67c8b wants to fence (reboot) 'h18' with device
>> '(any)'
>> Apr 27 02:21:07 h16 pacemaker-fenced[7443]:  notice: Merging stonith action
> 
>> 'reboot' targeting h18 originating from client
>> pacemaker-controld.7453.73d8bbd6 with identical request from
>> stonith-api.39797 at h16.ea22f429 (360>

This is also as expected when DLM is used. Despite the fencing 
previously proactively requested by DLM, pacemaker also has its reason 
to issue a fencing targeting the node. And fenced daemons is aware 
there's already the pending/on-going fencing targeting the same node, so 
it doesn't really need to issue it once again.

>>
>> Apr 27 02:22:52 h16 pacemaker-fenced[7443]:  warning: fence_legacy_reboot_1
> 
>> process (PID 39749) timed out
>> Apr 27 02:22:52 h16 pacemaker-fenced[7443]:  warning:
>> fence_legacy_reboot_1[39749] timed out after 120000ms
>> Apr 27 02:22:52 h16 pacemaker-fenced[7443]:  error: Operation 'reboot'
>> [39749] (call 2 from stonith_admin.controld.22336) for host 'h18' with
> device
>> 'prm_stonith_sbd' returned: -62 (Timer expired)

Please make sure:
stonith-timeout > sbd_msgwait + pcmk_delay_max

If it was already the case, probably sbd was encountering certain 
difficulties writing the poison pill at that time ...

Regards,
   Yan

>>
>> I never saw such message before. Evenbtually:
>>
>> Apr 27 02:24:53 h16 pacemaker-controld[7453]:  notice: Stonith operation
>> 31/1:3347:0:48bafcab-fecf-4ea0-84a8-c31ab1694b3a: OK (0)
>> Apr 27 02:24:53 h16 pacemaker-controld[7453]:  notice: Peer h18 was
>> terminated (reboot) by h16 on behalf of pacemaker-controld.7453: OK
>>
>> The olny thing I found out was that the kernel running without Xen does not
> 
>> show RAM corruption.
>>
>> Regards,
>> Ulrich
>>
>>
>>
>>
> 
> 
>