[ClusterLabs] Antw: Hanging OCFS2 Filesystem any one else?

Tue Jun 1 03:14:48 EDT 2021

>>> Ulrich Windl schrieb am 31.05.2021 um 12:11 in Nachricht <60B4B65A.A8F : 161 :
60728>:
> Hi!
> 
> We have an OCFS2 filesystem shared between three cluster nodes (SLES 15 SP2, 
> Kernel 5.3.18-24.64-default). The filesystem is filled up to about 95%, and 
> we have an odd effect:
> A stat() systemcall to some of the files hangs indefinitely (state "D").
> ("ls -l" and "rm" also hang, but I suspect those are calling state() 
> internally, too).
> My only suspect is that the effect might be related to the 95% being used.
> The other suspect is that concurrent reflink calls may trigger the effect.
> 
> Did anyone else experience something similar?

Hi!

I have some details:
It seems there is a reader/writer deadlock trying to allocate additional blocks for a file.
The stacktrace looks like this:
Jun 01 07:56:31 h16 kernel:  rwsem_down_write_slowpath+0x251/0x620
Jun 01 07:56:31 h16 kernel:  ? __ocfs2_change_file_space+0xb3/0x620 [ocfs2]
Jun 01 07:56:31 h16 kernel:  __ocfs2_change_file_space+0xb3/0x620 [ocfs2]
Jun 01 07:56:31 h16 kernel:  ocfs2_fallocate+0x82/0xa0 [ocfs2]
Jun 01 07:56:31 h16 kernel:  vfs_fallocate+0x13f/0x2a0
Jun 01 07:56:31 h16 kernel:  ksys_fallocate+0x3c/0x70
Jun 01 07:56:31 h16 kernel:  __x64_sys_fallocate+0x1a/0x20
Jun 01 07:56:31 h16 kernel:  do_syscall_64+0x5b/0x1e0

That is the only writer (on that host), bit there are multiple readers like this:
Jun 01 07:56:31 h16 kernel:  rwsem_down_read_slowpath+0x172/0x300
Jun 01 07:56:31 h16 kernel:  ? dput+0x2c/0x2f0
Jun 01 07:56:31 h16 kernel:  ? lookup_slow+0x27/0x50
Jun 01 07:56:31 h16 kernel:  lookup_slow+0x27/0x50
Jun 01 07:56:31 h16 kernel:  walk_component+0x1c4/0x300
Jun 01 07:56:31 h16 kernel:  ? path_init+0x192/0x320
Jun 01 07:56:31 h16 kernel:  path_lookupat+0x6e/0x210
Jun 01 07:56:31 h16 kernel:  ? __put_lkb+0x45/0xd0 [dlm]
Jun 01 07:56:31 h16 kernel:  filename_lookup+0xb6/0x190
Jun 01 07:56:31 h16 kernel:  ? kmem_cache_alloc+0x3d/0x250
Jun 01 07:56:31 h16 kernel:  ? getname_flags+0x66/0x1d0
Jun 01 07:56:31 h16 kernel:  ? vfs_statx+0x73/0xe0
Jun 01 07:56:31 h16 kernel:  vfs_statx+0x73/0xe0
Jun 01 07:56:31 h16 kernel:  ? fsnotify_grab_connector+0x46/0x80
Jun 01 07:56:31 h16 kernel:  __do_sys_newstat+0x39/0x70
Jun 01 07:56:31 h16 kernel:  ? do_unlinkat+0x92/0x320
Jun 01 07:56:31 h16 kernel:  do_syscall_64+0x5b/0x1e0

So that will match the hanging stat() quite nicely!

However the PID displayed as holding the writer does not exist in the system (on that node).

Regards,
Ulrich

> 
> Regards,
> Ulrich
> 
> 
> 
>