[ClusterLabs] [Linux-HA] file system resource becomes inaccesible when any of the node goes down
dejanmm at fastmail.fm
Mon Jul 6 10:04:53 EDT 2015
On Mon, Jul 06, 2015 at 03:14:34PM +0500, Muhammad Sharfuddin wrote:
> On 07/06/2015 02:50 PM, Dejan Muhamedagic wrote:
> >On Sun, Jul 05, 2015 at 09:13:56PM +0500, Muhammad Sharfuddin wrote:
> >>SLES 11 SP 3 + online updates(pacemaker-1.1.11-0.8.11.70
> >>Its a dual primary drbd cluster, which mounts a file system resource
> >>on both the cluster nodes simultaneously(file system type is ocfs2).
> >>Whenever any of the nodes goes down, the file system(/sharedata)
> >>become inaccessible for exact 35 seconds on the other
> >>(surviving/online) node, and then become available again on the
> >>online node.
> >>Please help me understand why the node which survives or remains
> >>online unable to access the file system resource(/sharedata) for 35
> >>seconds ? and how can I fix the cluster so that file system remains
> >>accessible on the surviving node without any interruption/delay(as
> >>in my case of about 35 seconds)
> >>By inaccessible, I meant to say that running "ls -l /sharedata" and
> >>"df /sharedata" does not return any output and does not return the
> >>prompt back on the online node for exact 35 seconds once the other
> >>node becomes offline.
> >>e.g "node1" got offline somewhere around 01:37:15, and then
> >>/sharedata file system was inaccessible during 01:37:35 and 01:38:18
> >>on the online node i.e "node2".
> >Before the failing node gets fenced you won't be able to use the
> >ocfs2 filesystem. In this case, the fencing operation takes 40
> so its expected.
> >>Jul 5 01:37:35 node2 sbd: : info: Writing reset to node slot node1
> >>Jul 5 01:37:35 node2 sbd: : info: Messaging delay: 40
> >>Jul 5 01:38:15 node2 sbd: : info: reset successfully
> >>delivered to node1
> >>Jul 5 01:38:15 node2 sbd: : info: Message successfully delivered.
> >You may want to reduce that sbd timeout.
> Ok, so reducing the sbd timeout(or msgwait) would provide the
> uninterrupted access to the ocfs2 file system on the
> surviving/online node ?
> or would it just minimize the downtime ?
Only the latter. But note that it is important that once sbd
reports success, the target node is really down. sbd is
timeout-based, i.e. it doesn't test whether the node actually
left. Hence this timeout shouldn't be too short.
> >Linux-HA mailing list is closing down.
> >Please subscribe to users at clusterlabs.org instead.
> >Linux-HA at lists.linux-ha.org
> Muhammad Sharfuddin
> Linux-HA mailing list is closing down.
> Please subscribe to users at clusterlabs.org instead.
> Linux-HA at lists.linux-ha.org
More information about the Users