[ClusterLabs] Antw: heads up: series of core dumps in SLES15 SP3 ("kernel: BUG: Bad rss-counter state mm:00000000d1a9d1f5 idx:1 val:4")
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Mon Mar 28 10:05:27 EDT 2022
Hi!
I want to keep you updated: The problem isn't fixed, still, so I
I'm running this simple script via cron to avoid uncontrolled kernel panic:
---snip---
#!/usr/bin/sh
# Detect RAM corruption. If detected log a message and reboot
# to prevent kernel panic
#cron jobs need a PATH
PATH=/sbin:/usr/sbin:/usr/bin:/bin
if journalctl -b -g 'Code: Bad RIP value|BUG: Bad rss-counter state mm:' >/dev/null
then
MSG='RAM corruption detected, starting pro-active reboot'
logger -t reboot-before-panic -p local0.notice "$MSG"
shutdown -r +1 "$MSG"
fi
---
Still I suspect it might be related to snapshots being made. After a few days of running the problems started again like this:
Mar 26 23:00:01 h19 systemd[1]: Started Timeline of Snapper Snapshots.
Mar 26 23:00:01 h19 dbus-daemon[5700]: [system] Activating via systemd: service name='org.opensuse.Snapper' unit='snapperd.service' requested by ':1.343' (uid=0 pid=11200 comm="/usr/lib/snapper/systemd-helper --timeline ")
Mar 26 23:00:01 h19 systemd[1]: Starting DBus interface for snapper...
Mar 26 23:00:01 h19 dbus-daemon[5700]: [system] Successfully activated service 'org.opensuse.Snapper'
Mar 26 23:00:01 h19 systemd[1]: Started DBus interface for snapper.
Mar 26 23:00:01 h19 systemd[1]: snapper-timeline.service: Succeeded.
Mar 26 23:00:01 h19 systemd[1]: Created slice Slice /system/systemd-coredump.
Mar 26 23:00:01 h19 systemd[1]: Started Process Core Dump (PID 11227/UID 0).
Mar 26 23:00:01 h19 systemd-coredump[11231]: Process 11226 (run-crons) of user 0 dumped core.
Stack trace of thread 11226:
#0 0x00007f89ff9dacdb raise (libc.so.6 + 0x4acdb)
#1 0x00007f89ff9dc324 abort (libc.so.6 + 0x4c324)
#2 0x00007f89ffa20b07 __libc_message (libc.so.6 + 0x90b07)
#3 0x00007f89ffa28b8a malloc_printerr (libc.so.6 + 0x98b8a)
#4 0x00007f89ffa2a634 _int_free (libc.so.6 + 0x9a634)
#5 0x000055c998de3963 command_substitute (bash + 0x9f963)
#6 0x000055c998ddb380 n/a (bash + 0x97380)
#7 0x000055c998ddda57 n/a (bash + 0x99a57)
#8 0x000055c998ddcb94 n/a (bash + 0x98b94)
#9 0x000055c998dc8955 n/a (bash + 0x84955)
#10 0x000055c998dc756d execute_command_internal (bash + 0x8356d)
#11 0x000055c998dc86e1 execute_command (bash + 0x846e1)
#12 0x000055c998dc76fd execute_command_internal (bash + 0x836fd)
#13 0x000055c998dc86e1 execute_command (bash + 0x846e1)
#14 0x000055c998dc8516 execute_command_internal (bash + 0x84516)
#15 0x000055c998dc773c execute_command_internal (bash + 0x8373c)
#16 0x000055c998dc86e1 execute_command (bash + 0x846e1)
#17 0x000055c998dc8007 execute_command_internal (bash + 0x84007)
#18 0x000055c998dc86e1 execute_command (bash + 0x846e1)
#19 0x000055c998dbce2b reader_loop (bash + 0x78e2b)
#20 0x000055c998dbcabc main (bash + 0x78abc)
#21 0x00007f89ff9c52bd __libc_start_main (libc.so.6 + 0x352bd)
#22 0x000055c998df729a _start (bash + 0xb329a)
Mar 26 23:00:01 h19 systemd[1]: systemd-coredump at 0-11227-0.service: Succeeded.
Mar 26 23:00:01 h19 kernel: BUG: Bad rss-counter state mm:00000000acc74328 idx:1 val:14
Mar 26 23:01:01 h19 systemd[1]: snapperd.service: Succeeded.
Mar 26 23:05:01 h19 reboot-before-panic[12356]: RAM corruption detected, starting pro-active reboot
Regards,
Ulrich
More information about the Users
mailing list