[ClusterLabs] Antw: Instable SLES15 SP3 kernel

Wed Apr 27 05:13:30 EDT 2022

Update for the Update:

I had installed SLES Updates in one VM and rebooted it via cluster. While
installing the updates in the VM the Xen host got RAM corruption (it seems any
disk I/O on the host, either locally or via a VM image causes RAM corruption):

Apr 27 10:56:44 h19 kernel: pacemaker-execd[39797]: segfault at 3a46 ip
0000000000003a46 sp 00007ffd1c92e8e8 error 14 in
pacemaker-execd[5565921cc000+b000]

Fortunately that wasn't fatal and my rescue script kicked in before things get
really bad:
Apr 27 11:00:01 h19 reboot-before-panic[40630]: RAM corruption detected,
starting pro-active reboot

All VMs could be live-migrated away before reboot, but this SLES release is
completely unusable!

Regards,
Ulrich

>>> Ulrich Windl schrieb am 27.04.2022 um 08:02 in Nachricht <6268DC91.C1D :
161 :
60728>:
> Hi!
> 
> I want to give a non-update on the issue:
> The kernel still segfaults random processes, and there is really nothing 
> from support within two months that could help improve the situation.
> The cluster is logging all kinds on non-funny messages like these:
> 
> Apr 27 02:20:49 h18 systemd-coredump[22319]: [��] Process 22317 (controld) 
> of user 0 dumped core.
> Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000246ea08b 
> idx:1 val:3
> Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000259b58a0 
> idx:1 val:7
> Apr 27 02:20:49 h18 controld(prm_DLM)[22330]: ERROR: Uncontrolled lockspace

> exists, system must reboot. Executing suicide fencing
> 
> For a hypervisor host this means that many VMs are reset the hard way!
> Other resources weren't stopped properly, too, of course.
> 
> 
> There also two NULL-pointer outputs in messages on the DC:
> Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Found 18 entries 
> for 118/(null): 0 in progress, 17 completed
> Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Node 118/(null) 
> last kicked at: 1650418762
> 
> I guess that NULL pointer should have been the host name (h18) in reality.
> 
> Also it seems h18 fenced itself, and the DC h16 seeing that wants to fence 
> again (to make sure, maybe), but there is some odd problem:
> 
> Apr 27 02:21:07 h16 pacemaker-controld[7453]:  notice: Requesting fencing 
> (reboot) of node h18
> Apr 27 02:21:07 h16 pacemaker-fenced[7443]:  notice: Client 
> pacemaker-controld.7453.a9d67c8b wants to fence (reboot) 'h18' with device 
> '(any)'
> Apr 27 02:21:07 h16 pacemaker-fenced[7443]:  notice: Merging stonith action

> 'reboot' targeting h18 originating from client 
> pacemaker-controld.7453.73d8bbd6 with identical request from 
> stonith-api.39797 at h16.ea22f429 (360>
> 
> Apr 27 02:22:52 h16 pacemaker-fenced[7443]:  warning: fence_legacy_reboot_1

> process (PID 39749) timed out
> Apr 27 02:22:52 h16 pacemaker-fenced[7443]:  warning: 
> fence_legacy_reboot_1[39749] timed out after 120000ms
> Apr 27 02:22:52 h16 pacemaker-fenced[7443]:  error: Operation 'reboot' 
> [39749] (call 2 from stonith_admin.controld.22336) for host 'h18' with
device 
> 'prm_stonith_sbd' returned: -62 (Timer expired)
> 
> I never saw such message before. Evenbtually:
> 
> Apr 27 02:24:53 h16 pacemaker-controld[7453]:  notice: Stonith operation 
> 31/1:3347:0:48bafcab-fecf-4ea0-84a8-c31ab1694b3a: OK (0)
> Apr 27 02:24:53 h16 pacemaker-controld[7453]:  notice: Peer h18 was 
> terminated (reboot) by h16 on behalf of pacemaker-controld.7453: OK
> 
> The olny thing I found out was that the kernel running without Xen does not

> show RAM corruption.
> 
> Regards,
> Ulrich
> 
> 
> 
>