[ClusterLabs] Antw: Instable SLES15 SP3 kernel
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Wed Apr 27 05:13:30 EDT 2022
Update for the Update:
I had installed SLES Updates in one VM and rebooted it via cluster. While
installing the updates in the VM the Xen host got RAM corruption (it seems any
disk I/O on the host, either locally or via a VM image causes RAM corruption):
Apr 27 10:56:44 h19 kernel: pacemaker-execd[39797]: segfault at 3a46 ip
0000000000003a46 sp 00007ffd1c92e8e8 error 14 in
pacemaker-execd[5565921cc000+b000]
Fortunately that wasn't fatal and my rescue script kicked in before things get
really bad:
Apr 27 11:00:01 h19 reboot-before-panic[40630]: RAM corruption detected,
starting pro-active reboot
All VMs could be live-migrated away before reboot, but this SLES release is
completely unusable!
Regards,
Ulrich
>>> Ulrich Windl schrieb am 27.04.2022 um 08:02 in Nachricht <6268DC91.C1D :
161 :
60728>:
> Hi!
>
> I want to give a non-update on the issue:
> The kernel still segfaults random processes, and there is really nothing
> from support within two months that could help improve the situation.
> The cluster is logging all kinds on non-funny messages like these:
>
> Apr 27 02:20:49 h18 systemd-coredump[22319]: [] Process 22317 (controld)
> of user 0 dumped core.
> Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000246ea08b
> idx:1 val:3
> Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000259b58a0
> idx:1 val:7
> Apr 27 02:20:49 h18 controld(prm_DLM)[22330]: ERROR: Uncontrolled lockspace
> exists, system must reboot. Executing suicide fencing
>
> For a hypervisor host this means that many VMs are reset the hard way!
> Other resources weren't stopped properly, too, of course.
>
>
> There also two NULL-pointer outputs in messages on the DC:
> Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Found 18 entries
> for 118/(null): 0 in progress, 17 completed
> Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Node 118/(null)
> last kicked at: 1650418762
>
> I guess that NULL pointer should have been the host name (h18) in reality.
>
> Also it seems h18 fenced itself, and the DC h16 seeing that wants to fence
> again (to make sure, maybe), but there is some odd problem:
>
> Apr 27 02:21:07 h16 pacemaker-controld[7453]: notice: Requesting fencing
> (reboot) of node h18
> Apr 27 02:21:07 h16 pacemaker-fenced[7443]: notice: Client
> pacemaker-controld.7453.a9d67c8b wants to fence (reboot) 'h18' with device
> '(any)'
> Apr 27 02:21:07 h16 pacemaker-fenced[7443]: notice: Merging stonith action
> 'reboot' targeting h18 originating from client
> pacemaker-controld.7453.73d8bbd6 with identical request from
> stonith-api.39797 at h16.ea22f429 (360>
>
> Apr 27 02:22:52 h16 pacemaker-fenced[7443]: warning: fence_legacy_reboot_1
> process (PID 39749) timed out
> Apr 27 02:22:52 h16 pacemaker-fenced[7443]: warning:
> fence_legacy_reboot_1[39749] timed out after 120000ms
> Apr 27 02:22:52 h16 pacemaker-fenced[7443]: error: Operation 'reboot'
> [39749] (call 2 from stonith_admin.controld.22336) for host 'h18' with
device
> 'prm_stonith_sbd' returned: -62 (Timer expired)
>
> I never saw such message before. Evenbtually:
>
> Apr 27 02:24:53 h16 pacemaker-controld[7453]: notice: Stonith operation
> 31/1:3347:0:48bafcab-fecf-4ea0-84a8-c31ab1694b3a: OK (0)
> Apr 27 02:24:53 h16 pacemaker-controld[7453]: notice: Peer h18 was
> terminated (reboot) by h16 on behalf of pacemaker-controld.7453: OK
>
> The olny thing I found out was that the kernel running without Xen does not
> show RAM corruption.
>
> Regards,
> Ulrich
>
>
>
>
More information about the Users
mailing list