[ClusterLabs] Antw: heads up: series of core dumps in SLES15 SP3 ("kernel: BUG: Bad rss-counter state mm:00000000d1a9d1f5 idx:1 val:4")
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Tue Mar 1 03:48:26 EST 2022
Hi!
I want to give an update on this issue (support is working on it):
First I recommend everyone using Xen and a Dell PowerEdge R7415 _not_ to upgrade to SLES15 SP3, as we have about one crash per node and week.
We had one last night a few minutes after BtrFS balance had finished.
Meanwhile we also had a few crash dumps (kdump), and the call stacks are like this:
(5.3.18-150300.59.43-default)
[1175886.947081] ocfs2: Mounting device (9,10) on (node 116, slot 1) with ordered data mode.
[1175905.783132] general protection fault: 0000 [#1] SMP NOPTI
[1175905.785704] RIP: e030:down_read_trylock+0x18/0x50
[1175905.798982] Call Trace:
[1175905.800305] page_lock_anon_vma_read+0x48/0xe0
[1175905.801659] rmap_walk_anon+0x16c/0x250
[1175905.803021] page_referenced+0xd5/0x170
[1175905.804254] ? rmap_walk_anon+0x250/0x250
[1175905.805377] ? page_get_anon_vma+0x80/0x80
[1175905.806593] shrink_active_list+0x2dd/0x490
[1175905.807920] shrink_lruvec+0x4aa/0x6e0
[1175905.809253] ? free_unref_page_list+0x16f/0x180
[1175905.810460] ? free_unref_page_list+0x16f/0x180
[1175905.811024] ? shrink_node+0x143/0x600
[1175905.811593] shrink_node+0x143/0x600
[1175905.812149] balance_pgdat+0x28a/0x630
[1175905.812716] kswapd+0x14b/0x3d0
[1175905.813263] ? wait_woken+0x80/0x80
[1175905.813793] ? balance_pgdat+0x630/0x630
[1175905.814319] kthread+0x10d/0x130
[1175905.814847] ? kthread_park+0xa0/0xa0
[1175905.815386] ret_from_fork+0x22/0x40
(5.3.18-150300.59.49-default)
[27926.595977] BUG: kernel NULL pointer dereference, address: 0000000000000007
[27926.597124] #PF: supervisor write access in kernel mode
[27926.598197] #PF: error_code(0x0002) - not-present page
[27926.599265] PGD 0 P4D 0
[27926.600322] Oops: 0002 [#1] SMP NOPTI
[27926.603924] RIP: e030:down_read_trylock+0x18/0x50
[27926.618618] Call Trace:
[27926.619138] page_lock_anon_vma_read+0x48/0xe0
[27926.619665] rmap_walk_anon+0x16c/0x250
[27926.620182] page_referenced+0xd5/0x170
[27926.620705] ? rmap_walk_anon+0x250/0x250
[27926.621214] ? page_get_anon_vma+0x80/0x80
[27926.621772] shrink_active_list+0x2dd/0x490
[27926.622280] balance_pgdat+0x50b/0x630
[27926.622793] kswapd+0x14b/0x3d0
[27926.623308] ? wait_woken+0x80/0x80
[27926.623822] ? balance_pgdat+0x630/0x630
[27926.624329] kthread+0x10d/0x130
[27926.624843] ? kthread_park+0xa0/0xa0
[27926.625577] ret_from_fork+0x22/0x40
(5.3.18-150300.59.49-default)
[566428.257264] BTRFS info (device dm-0): scrub: finished on devid 1 with status: 0
[571252.379396] ping[3707]: segfault at 0 ip 0000000000000000 sp 00007fffbae8dc10 error 14 in bash[55a3a9d26000+f1000]
[571252.379410] Code: Bad RIP value.
[571252.948920] BUG: Bad rss-counter state mm:000000004d568db6 idx:1 val:4
[571262.985375] general protection fault: 0000 [#1] SMP NOPTI
[571262.985410] RIP: e030:down_read_trylock+0x18/0x50
[571262.985470] Call Trace:
[571262.985481] page_lock_anon_vma_read+0x48/0xe0
[571262.985487] rmap_walk_anon+0x16c/0x250
[571262.985492] page_referenced+0xd5/0x170
[571262.985496] ? rmap_walk_anon+0x250/0x250
[571262.985500] ? page_get_anon_vma+0x80/0x80
[571262.985506] shrink_active_list+0x2dd/0x490
[571262.985512] shrink_lruvec+0x4aa/0x6e0
[571262.985517] ? free_unref_page_list+0x16f/0x180
[571262.985522] ? free_unref_page_list+0x16f/0x180
[571262.985526] ? shrink_node+0x143/0x600
[571262.985529] shrink_node+0x143/0x600
[571262.985534] balance_pgdat+0x28a/0x630
[571262.985539] kswapd+0x14b/0x3d0
[571262.985544] ? wait_woken+0x80/0x80
[571262.985548] ? balance_pgdat+0x630/0x630
[571262.985553] kthread+0x10d/0x130
[571262.985557] ? kthread_park+0xa0/0xa0
[571262.985563] ret_from_fork+0x22/0x40
(5.3.18-150300.59.49-default)
[22707.270890] #PF: supervisor write access in kernel mode
[22707.271539] #PF: error_code(0x0002) - not-present page
[22707.272159] PGD 0 P4D 0
[22707.272786] Oops: 0002 [#1] SMP NOPTI
[22707.274680] RIP: e030:down_read_trylock+0x18/0x50
[22707.282129] Call Trace:
[22707.282682] page_lock_anon_vma_read+0x48/0xe0
[22707.283225] rmap_walk_anon+0x16c/0x250
[22707.283772] ? uncharge_batch+0xe3/0x180
[22707.284307] try_to_unmap+0x93/0xf0
[22707.284835] ? page_remove_rmap+0x2c0/0x2c0
[22707.285375] ? page_not_mapped+0x20/0x20
[22707.285906] ? page_get_anon_vma+0x80/0x80
[22707.286428] ? invalid_mkclean_vma+0x20/0x20
[22707.286974] migrate_pages+0x857/0xb50
[22707.287494] ? isolate_freepages_block+0x370/0x370
[22707.288013] ? move_freelist_tail+0xd0/0xd0
[22707.288538] compact_zone+0x775/0xd90
[22707.289059] kcompactd_do_work+0xfe/0x2a0
[22707.289576] ? xen_load_sp0+0x7a/0x160
[22707.290096] ? __set_cpus_allowed_ptr+0xb5/0x1e0
[22707.290623] ? kcompactd_do_work+0x2a0/0x2a0
[22707.291158] ? kcompactd+0x84/0x1e0
[22707.291703] kcompactd+0x84/0x1e0
[22707.292243] ? wait_woken+0x80/0x80
[22707.292788] kthread+0x10d/0x130
[22707.293329] ? kthread_park+0xa0/0xa0
[22707.293868] ret_from_fork+0x22/0x40
All those dumps happened with BIOS 1.17.0; today I realized that there was a critical BIOS update (1.18.0) that looks as if the CPU microcode was updated.
However I don't know any details abot the change.
Regards,
Ulrich
>>> Ulrich Windl schrieb am 10.02.2022 um 09:02 in Nachricht <6204C67B.FDC : 161 :
60728>:
> Hi!
>
> just a heads up: Shortly after midnight one of our SLES15 SP3 cluster nodes
> started to send SIGSEGV to processes, eventually also to pacemaker, resulting
> in a node fence.
> I suspect some kernel problem. The configuration was running since a week or
> so (since last reboot), however.
> (For SP2 we once had a kernel lock-up, and the suspect was that it might be
> related to BtrFS balancing or snapshotting, but that was just as suspect. At
> midnight snapper was activated, too, so who knows...)
> Kernel: 5.3.18-150300.59.43-default
>
> Summary of events:
> Feb 10 00:00:02 h16 dbus-daemon[5905]: [system] Successfully activated
> service 'org.opensuse.Snapper'
> Feb 10 00:00:02 h16 systemd[1]: Started DBus interface for snapper.
> Feb 10 00:00:02 h16 systemd[1]: snapper-timeline.service: Succeeded.
> Feb 10 00:00:02 h16 kernel: traps: mandb[4484] general protection fault
> ip:7f4f21876160 sp:7ffe25a71ff8 error:0 in libc-2.31.so[7f4f217fa000+1cb000]
> Feb 10 00:00:03 h16 systemd-coredump[4488]: Process 4484 (mandb) of user 13
> dumped core.
> Feb 10 00:00:03 h16 kernel: BUG: Bad rss-counter state mm:00000000d1a9d1f5
> idx:1 val:4
> Feb 10 00:00:03 h16 kernel: mandb[4547]: segfault at 8b86 ip
> 0000000000008b86 sp 00007ffe25a73058 error 14 in mandb[55dcc3a12000+20000]
> Feb 10 00:00:03 h16 kernel: Code: Bad RIP value.
> Feb 10 00:00:04 h16 systemd-coredump[4549]: Process 4547 (mandb) of user 13
> dumped core.
> Feb 10 00:00:04 h16 kernel: BUG: Bad rss-counter state mm:00000000c4f00529
> idx:1 val:5
> Feb 10 00:00:05 h16 kernel: BUG: Bad rss-counter state mm:00000000aae27ee5
> idx:1 val:59
> Feb 10 00:00:06 h16 systemd-coredump[4610]: Process 4606 (mandb) of user 13
> dumped core.
> Feb 10 00:00:06 h16 kernel: traps: mandb[4640] general protection fault
> ip:7f4f218c6caf sp:7ffe25a73110 error:0 in libc-2.31.so[7f4f217fa000+1cb000]
> Feb 10 00:00:06 h16 kernel: BUG: Bad rss-counter state mm:00000000babee882
> idx:1 val:2
> Feb 10 00:00:08 h16 systemd-coredump[4645]: Process 4643 (systemd) of user 0
> dumped core.
>
> That doesn't sound good, does it?
>
> Feb 10 00:00:08 h16 systemd[4642]: Caught <SEGV>, dumped core as pid 4643.
> Feb 10 00:00:08 h16 systemd[4642]: Freezing execution.
> Feb 10 00:00:29 h16 kernel: pacemaker-execd[4704]: segfault at 3a46 ip
> 0000000000003a46 sp 00007ffe2c700508 error 14 in
> pacemaker-execd[55e474755000+b000]
> Feb 10 00:00:29 h16 kernel: Code: Bad RIP value.
> Feb 10 00:00:30 h16 kernel: BUG: Bad rss-counter state mm:00000000b1203e21
> idx:1 val:2
> Feb 10 00:00:34 h16 kernel: libvirtd[5685]: segfault at 0 ip
> 00007f745c487e73 sp 00007ffc70e95a58 error 6 in
> libc-2.31.so[7f745c3fe000+1cb000]
> Feb 10 00:00:34 h16 kernel: Code: Bad RIP value.
> Feb 10 00:00:34 h16 kernel: BUG: Bad rss-counter state mm:00000000d755caae
> idx:1 val:69691
> Feb 10 00:00:34 h16 kernel: VirtualDomain[5781]: segfault at 0 ip
> 0000000000000000 sp 00007ffdc5c98660 error 14 in bash[55669b8cb000+f1000]
> Feb 10 00:00:34 h16 kernel: Code: Bad RIP value.
> Feb 10 00:00:35 h16 systemd-coredump[5742]: Process 5689 (Filesystem) of
> user 0 dumped core.
> Feb 10 00:00:35 h16 kernel: BUG: Bad rss-counter state mm:0000000042171789
> idx:1 val:2
> Feb 10 00:00:36 h16 systemd-coredump[5803]: Process 5781 (VirtualDomain) of
> user 0 dumped core.
> Feb 10 00:00:36 h16 kernel: BUG: Bad rss-counter state mm:00000000713058ae
> idx:1 val:6
> ...many more...
> Feb 10 00:03:33 h16 systemd-coredump[13479]: Process 13400 (systemd) of user
> 0 dumped core.
> -- Reboot --
> Feb 10 00:06:59 h16 kernel: Linux version 5.3.18-150300.59.43-default
> (geeko at buildhost) (gcc version 7.5.0 (SUSE Linux)) #1 SMP Sun Jan 23 19:27:23
> UTC 2022 (c76af22)
> (eventually)
>
> Another reboot:
> Feb 10 00:08:18 h16 sbd[7067]: emerg: do_exit: Rebooting system: reboot
> -- Reboot --
> Feb 10 00:11:43 h16 kernel: Linux version 5.3.18-150300.59.43-default
> (geeko at buildhost) (gcc version 7.5.0 (SUSE Linux)) #1 SMP Sun Jan 23 19:27:23
> UTC 2022 (c76af22)
>
> Since then the node (Dell PowerEdge R7415) is running normally again.
>
> Regards,
> Ulrich
>
>
More information about the Users
mailing list