[ClusterLabs] heads up: series of core dumps in SLES15 SP3 ("kernel: BUG: Bad rss-counter state mm:00000000d1a9d1f5 idx:1 val:4")

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Thu Feb 10 03:02:03 EST 2022


Hi!

just a heads up: Shortly after midnight one of our SLES15 SP3 cluster nodes started to send SIGSEGV to processes, eventually also to pacemaker, resulting in a node fence.
I suspect some kernel problem. The configuration was running since a week or so (since last reboot), however.
(For SP2 we once had a kernel lock-up, and the suspect was that it might be related to BtrFS balancing or snapshotting, but that was just as suspect. At midnight snapper was activated, too, so who knows...)
Kernel: 5.3.18-150300.59.43-default

Summary of events:
Feb 10 00:00:02 h16 dbus-daemon[5905]: [system] Successfully activated service 'org.opensuse.Snapper'
Feb 10 00:00:02 h16 systemd[1]: Started DBus interface for snapper.
Feb 10 00:00:02 h16 systemd[1]: snapper-timeline.service: Succeeded.
Feb 10 00:00:02 h16 kernel: traps: mandb[4484] general protection fault ip:7f4f21876160 sp:7ffe25a71ff8 error:0 in libc-2.31.so[7f4f217fa000+1cb000]
Feb 10 00:00:03 h16 systemd-coredump[4488]: Process 4484 (mandb) of user 13 dumped core.
Feb 10 00:00:03 h16 kernel: BUG: Bad rss-counter state mm:00000000d1a9d1f5 idx:1 val:4
Feb 10 00:00:03 h16 kernel: mandb[4547]: segfault at 8b86 ip 0000000000008b86 sp 00007ffe25a73058 error 14 in mandb[55dcc3a12000+20000]
Feb 10 00:00:03 h16 kernel: Code: Bad RIP value.
Feb 10 00:00:04 h16 systemd-coredump[4549]: Process 4547 (mandb) of user 13 dumped core.
Feb 10 00:00:04 h16 kernel: BUG: Bad rss-counter state mm:00000000c4f00529 idx:1 val:5
Feb 10 00:00:05 h16 kernel: BUG: Bad rss-counter state mm:00000000aae27ee5 idx:1 val:59
Feb 10 00:00:06 h16 systemd-coredump[4610]: Process 4606 (mandb) of user 13 dumped core.
Feb 10 00:00:06 h16 kernel: traps: mandb[4640] general protection fault ip:7f4f218c6caf sp:7ffe25a73110 error:0 in libc-2.31.so[7f4f217fa000+1cb000]
Feb 10 00:00:06 h16 kernel: BUG: Bad rss-counter state mm:00000000babee882 idx:1 val:2
Feb 10 00:00:08 h16 systemd-coredump[4645]: Process 4643 (systemd) of user 0 dumped core.

That doesn't sound good, does it?

Feb 10 00:00:08 h16 systemd[4642]: Caught <SEGV>, dumped core as pid 4643.
Feb 10 00:00:08 h16 systemd[4642]: Freezing execution.
Feb 10 00:00:29 h16 kernel: pacemaker-execd[4704]: segfault at 3a46 ip 0000000000003a46 sp 00007ffe2c700508 error 14 in pacemaker-execd[55e474755000+b000]
Feb 10 00:00:29 h16 kernel: Code: Bad RIP value.
Feb 10 00:00:30 h16 kernel: BUG: Bad rss-counter state mm:00000000b1203e21 idx:1 val:2
Feb 10 00:00:34 h16 kernel: libvirtd[5685]: segfault at 0 ip 00007f745c487e73 sp 00007ffc70e95a58 error 6 in libc-2.31.so[7f745c3fe000+1cb000]
Feb 10 00:00:34 h16 kernel: Code: Bad RIP value.
Feb 10 00:00:34 h16 kernel: BUG: Bad rss-counter state mm:00000000d755caae idx:1 val:69691
Feb 10 00:00:34 h16 kernel: VirtualDomain[5781]: segfault at 0 ip 0000000000000000 sp 00007ffdc5c98660 error 14 in bash[55669b8cb000+f1000]
Feb 10 00:00:34 h16 kernel: Code: Bad RIP value.
Feb 10 00:00:35 h16 systemd-coredump[5742]: Process 5689 (Filesystem) of user 0 dumped core.
Feb 10 00:00:35 h16 kernel: BUG: Bad rss-counter state mm:0000000042171789 idx:1 val:2
Feb 10 00:00:36 h16 systemd-coredump[5803]: Process 5781 (VirtualDomain) of user 0 dumped core.
Feb 10 00:00:36 h16 kernel: BUG: Bad rss-counter state mm:00000000713058ae idx:1 val:6
...many more...
Feb 10 00:03:33 h16 systemd-coredump[13479]: Process 13400 (systemd) of user 0 dumped core.
-- Reboot --
Feb 10 00:06:59 h16 kernel: Linux version 5.3.18-150300.59.43-default (geeko at buildhost) (gcc version 7.5.0 (SUSE Linux)) #1 SMP Sun Jan 23 19:27:23 UTC 2022 (c76af22)
(eventually)

Another reboot:
Feb 10 00:08:18 h16 sbd[7067]:    emerg: do_exit: Rebooting system: reboot
-- Reboot --
Feb 10 00:11:43 h16 kernel: Linux version 5.3.18-150300.59.43-default (geeko at buildhost) (gcc version 7.5.0 (SUSE Linux)) #1 SMP Sun Jan 23 19:27:23 UTC 2022 (c76af22)

Since then the node (Dell PowerEdge R7415) is running normally again.

Regards,
Ulrich




More information about the Users mailing list