[ClusterLabs] Antw: Re: Antw: [EXT] Re: Antw: Hanging OCFS2 Filesystem any one else?

Gang He ghe at suse.com
Mon Jul 12 23:39:54 EDT 2021



On 2021/7/12 15:52, Ulrich Windl wrote:
> Hi!
> 
> can you give some details on what is necessary to trigger the problem?
There is a ABBA lock between reflink comand and ocfs2 
ocfs2_complete_recovery routine(this routine will be triggered by timer, 
mount, node recovery), the dead lock is not always encountered.
For the more details, refer to the link as below,
https://oss.oracle.com/pipermail/ocfs2-devel/2021-July/015671.html

Thanks
Gang

> (I/O load, CPU load, concurrent operations on one node or on multiple nodes,
> using reflink snapshots, using ioctl(FS_IOC_FIEMAP), etc.)
> 
> Regards,
> Ulrich
> 
>>>> Gang He <GHe at suse.com> schrieb am 11.07.2021 um 10:55 in Nachricht
> <AM6PR04MB648817316DB7B124F414D60ACF169 at AM6PR04MB6488.eurprd04.prod.outlook.com>
> 
>> Hi Ulrich,
>>
>> Thank for your update.
>> Based on some feedback from the upstream, there is a patch (ocfs2:
>> initialize ip_next_orphan), which should fix this problem.
>> I can comfirm the patch looks very similar with your problem.
>> I will verify it next week, then let you know the result.
>>
>> Thanks
>> Gang
>>
>> ________________________________________
>> From: Users <users‑bounces at clusterlabs.org> on behalf of Ulrich Windl
>> <Ulrich.Windl at rz.uni‑regensburg.de>
>> Sent: Friday, July 9, 2021 15:56
>> To: users at clusterlabs.org
>> Subject: [ClusterLabs] Antw: [EXT] Re: Antw: Hanging OCFS2 Filesystem any
>> one else?
>>
>> Hi!
>>
>> An update on the issue:
>> SUSE support found out that the reason for the hanging processes is a
>> deadlock caused by a race condition (Kernel 5.3.18‑24.64‑default). Support
> is
>> working on a fix.
>> Today the cluster "fixed" the problem in an unusual way:
>>
>> h19 kernel: Out of memory: Killed process 6838 (corosync) total‑vm:261212kB,
> 
>> anon‑rss:31444kB, file‑rss:7700kB, shmem‑rss:121872kB
>>
>> I doubt that was the best possible choice ;‑)
>>
>> The dead corosync caused the DC (h18) to fence h19 (which was successful),
>> but the DC was fenced while it tried to recover resources, so the complete
>> cluster rebooted.
>>
>> Regards,
>> Ulrich
>>
>>
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 



More information about the Users mailing list