[ClusterLabs] [Linux-HA] file system resource becomes inaccesible when any of the node goes down

Mon Jul 6 11:56:17 EDT 2015

On 07/06/2015 07:04 PM, Dejan Muhamedagic wrote:
> On Mon, Jul 06, 2015 at 03:14:34PM +0500, Muhammad Sharfuddin wrote:
>> On 07/06/2015 02:50 PM, Dejan Muhamedagic wrote:
>>> Hi,
>>>
>>> On Sun, Jul 05, 2015 at 09:13:56PM +0500, Muhammad Sharfuddin wrote:
>>>> SLES 11 SP 3 + online updates(pacemaker-1.1.11-0.8.11.70
>>>> openais-1.1.4-5.22.1.7)
>>>>
>>>> Its a dual primary drbd cluster, which mounts a file system resource
>>>> on both the cluster nodes simultaneously(file system type is ocfs2).
>>>>
>>>> Whenever any of the nodes goes down, the file system(/sharedata)
>>>> become inaccessible for exact 35 seconds on the other
>>>> (surviving/online) node, and then become available again on the
>>>> online node.
>>>>
>>>> Please help me understand why the node which survives or remains
>>>> online unable to access the file system resource(/sharedata) for 35
>>>> seconds ? and how can I fix the cluster so that file system remains
>>>> accessible on the surviving node without any interruption/delay(as
>>>> in my case of about 35 seconds)
>>>>
>>>> By inaccessible, I meant to say that running "ls -l /sharedata" and
>>>> "df /sharedata" does not return any output and does not return the
>>>> prompt back on the online node for exact 35 seconds once the other
>>>> node becomes offline.
>>>>
>>>> e.g "node1" got offline somewhere around  01:37:15, and then
>>>> /sharedata file system was inaccessible during 01:37:35 and 01:38:18
>>>> on the online node i.e "node2".
>>> Before the failing node gets fenced you won't be able to use the
>>> ocfs2 filesystem. In this case, the fencing operation takes 40
>>> seconds:
>> so its expected.
>>>> [...]
>>>> Jul  5 01:37:35 node2 sbd: [6197]: info: Writing reset to node slot node1
>>>> Jul  5 01:37:35 node2 sbd: [6197]: info: Messaging delay: 40
>>>> Jul  5 01:38:15 node2 sbd: [6197]: info: reset successfully
>>>> delivered to node1
>>>> Jul  5 01:38:15 node2 sbd: [6196]: info: Message successfully delivered.
>>>> [...]
>>> You may want to reduce that sbd timeout.
>> Ok, so reducing the sbd timeout(or msgwait) would provide the
>> uninterrupted access to the ocfs2 file system on the
>> surviving/online node ?
>> or would it just minimize the downtime ?
> Only the latter. But note that it is important that once sbd
> reports success, the target node is really down. sbd is
> timeout-based, i.e. it doesn't test whether the node actually
> left. Hence this timeout shouldn't be too short.

Hmm, by the way for watchdog and msgwait timeout values, I always 
blindly follow the suggested values @ 
https://www.novell.com/support/kb/doc.php?id=7011346
and the Suggested value there is 20 for watchdog, and 40 for msgwait.

I'll check the setup after reducing the timeout of watchdog to 10 and 
msgwait to 20

> Thanks,
>
> Dejan
>
>>> Thanks,
>>>
>>> Dejan
>>> _______________________________________________
>>> Linux-HA mailing list is closing down.
>>> Please subscribe to users at clusterlabs.org instead.
>>> http://clusterlabs.org/mailman/listinfo/users
>>> _______________________________________________
>>> Linux-HA at lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>
>> --
>> Regards,
>>
>> Muhammad Sharfuddin
>> _______________________________________________
>> Linux-HA mailing list is closing down.
>> Please subscribe to users at clusterlabs.org instead.
>> http://clusterlabs.org/mailman/listinfo/users
>> _______________________________________________
>> Linux-HA at lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
Regards,

Muhammad Sharfuddin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150706/96090a32/attachment-0003.html>