[ClusterLabs] [Linux-HA] file system resource becomes inaccesible when any of the node goes down

Mon Jul 6 10:04:53 EDT 2015

On Mon, Jul 06, 2015 at 03:14:34PM +0500, Muhammad Sharfuddin wrote:
> On 07/06/2015 02:50 PM, Dejan Muhamedagic wrote:
> >Hi,
> >
> >On Sun, Jul 05, 2015 at 09:13:56PM +0500, Muhammad Sharfuddin wrote:
> >>SLES 11 SP 3 + online updates(pacemaker-1.1.11-0.8.11.70
> >>openais-1.1.4-5.22.1.7)
> >>
> >>Its a dual primary drbd cluster, which mounts a file system resource
> >>on both the cluster nodes simultaneously(file system type is ocfs2).
> >>
> >>Whenever any of the nodes goes down, the file system(/sharedata)
> >>become inaccessible for exact 35 seconds on the other
> >>(surviving/online) node, and then become available again on the
> >>online node.
> >>
> >>Please help me understand why the node which survives or remains
> >>online unable to access the file system resource(/sharedata) for 35
> >>seconds ? and how can I fix the cluster so that file system remains
> >>accessible on the surviving node without any interruption/delay(as
> >>in my case of about 35 seconds)
> >>
> >>By inaccessible, I meant to say that running "ls -l /sharedata" and
> >>"df /sharedata" does not return any output and does not return the
> >>prompt back on the online node for exact 35 seconds once the other
> >>node becomes offline.
> >>
> >>e.g "node1" got offline somewhere around  01:37:15, and then
> >>/sharedata file system was inaccessible during 01:37:35 and 01:38:18
> >>on the online node i.e "node2".
> >Before the failing node gets fenced you won't be able to use the
> >ocfs2 filesystem. In this case, the fencing operation takes 40
> >seconds:
> so its expected.
> >>[...]
> >>Jul  5 01:37:35 node2 sbd: [6197]: info: Writing reset to node slot node1
> >>Jul  5 01:37:35 node2 sbd: [6197]: info: Messaging delay: 40
> >>Jul  5 01:38:15 node2 sbd: [6197]: info: reset successfully
> >>delivered to node1
> >>Jul  5 01:38:15 node2 sbd: [6196]: info: Message successfully delivered.
> >>[...]
> >You may want to reduce that sbd timeout.
> Ok, so reducing the sbd timeout(or msgwait) would provide the
> uninterrupted access to the ocfs2 file system on the
> surviving/online node ?
> or would it just minimize the downtime ?

Only the latter. But note that it is important that once sbd
reports success, the target node is really down. sbd is
timeout-based, i.e. it doesn't test whether the node actually
left. Hence this timeout shouldn't be too short.

Thanks,

Dejan

> >Thanks,
> >
> >Dejan
> >_______________________________________________
> >Linux-HA mailing list is closing down.
> >Please subscribe to users at clusterlabs.org instead.
> >http://clusterlabs.org/mailman/listinfo/users
> >_______________________________________________
> >Linux-HA at lists.linux-ha.org
> >http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >
> --
> Regards,
> 
> Muhammad Sharfuddin
> _______________________________________________
> Linux-HA mailing list is closing down.
> Please subscribe to users at clusterlabs.org instead.
> http://clusterlabs.org/mailman/listinfo/users
> _______________________________________________
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha