[ClusterLabs] Antw: Instable SLES15 SP3 kernel

Wed Apr 27 02:02:57 EDT 2022

Hi!

I want to give a non-update on the issue:
The kernel still segfaults random processes, and there is really nothing from
support within two months that could help improve the situation.
The cluster is logging all kinds on non-funny messages like these:

Apr 27 02:20:49 h18 systemd-coredump[22319]: [��] Process 22317 (controld) of
user 0 dumped core.
Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000246ea08b
idx:1 val:3
Apr 27 02:20:49 h18 kernel: BUG: Bad rss-counter state mm:00000000259b58a0
idx:1 val:7
Apr 27 02:20:49 h18 controld(prm_DLM)[22330]: ERROR: Uncontrolled lockspace
exists, system must reboot. Executing suicide fencing

For a hypervisor host this means that many VMs are reset the hard way!
Other resources weren't stopped properly, too, of course.

There also two NULL-pointer outputs in messages on the DC:
Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Found 18 entries for
118/(null): 0 in progress, 17 completed
Apr 27 02:21:06 h16 dlm_stonith[39797]: stonith_api_time: Node 118/(null) last
kicked at: 1650418762

I guess that NULL pointer should have been the host name (h18) in reality.

Also it seems h18 fenced itself, and the DC h16 seeing that wants to fence
again (to make sure, maybe), but there is some odd problem:

Apr 27 02:21:07 h16 pacemaker-controld[7453]:  notice: Requesting fencing
(reboot) of node h18
Apr 27 02:21:07 h16 pacemaker-fenced[7443]:  notice: Client
pacemaker-controld.7453.a9d67c8b wants to fence (reboot) 'h18' with device
'(any)'
Apr 27 02:21:07 h16 pacemaker-fenced[7443]:  notice: Merging stonith action
'reboot' targeting h18 originating from client pacemaker-controld.7453.73d8bbd6
with identical request from stonith-api.39797 at h16.ea22f429 (360>

Apr 27 02:22:52 h16 pacemaker-fenced[7443]:  warning: fence_legacy_reboot_1
process (PID 39749) timed out
Apr 27 02:22:52 h16 pacemaker-fenced[7443]:  warning:
fence_legacy_reboot_1[39749] timed out after 120000ms
Apr 27 02:22:52 h16 pacemaker-fenced[7443]:  error: Operation 'reboot' [39749]
(call 2 from stonith_admin.controld.22336) for host 'h18' with device
'prm_stonith_sbd' returned: -62 (Timer expired)

I never saw such message before. Evenbtually:

Apr 27 02:24:53 h16 pacemaker-controld[7453]:  notice: Stonith operation
31/1:3347:0:48bafcab-fecf-4ea0-84a8-c31ab1694b3a: OK (0)
Apr 27 02:24:53 h16 pacemaker-controld[7453]:  notice: Peer h18 was terminated
(reboot) by h16 on behalf of pacemaker-controld.7453: OK

The olny thing I found out was that the kernel running without Xen does not
show RAM corruption.

Regards,
Ulrich