[Pacemaker] OCFS2 problems when connectivity lost

Wed Dec 21 07:07:25 EST 2011

On 12/21/2011 09:47 PM, Ivan Savčić | Epix wrote:
> Hello,
>
>
> We are having a problem with a 3-node cluster based on
> Pacemaker/Corosync with 2 primary DRBD+OCFS2 nodes and a quorum node.
>
> Nodes run on Debian Squeeze, all packages are from the stable branch
> except for Corosync (which is from backports for udpu functionality).
> Each node has a single network card.
>
> When the network is up, everything works without any problems, graceful
> shutdown of resources on any node works as intended and doesn't reflect
> on the remaining cluster partition.
>
> When the network is down on one OCFS2 node, Pacemaker
> (no-quorum-policy="stop") tries to shut the resources down on that node,
> but fails to stop the OCFS2 filesystem resource stating that it is "in
> use".
>
> *Both* OCFS2 nodes (ie. the one with the network down and the one which
> is still up in the partition with quorum) hang with dmesg reporting that
> events, ocfs2rec and ocfs2_wq are "blocked for more than 120 seconds".

My guess would be:

The filesystem can't stop on the non-quorate node, because the network 
connection is down, so DLM can't do its thing.

The filesystem is probably frozen on the quorate node, because of loss 
of DLM comms.

If STONITH is configured, the non-quorate node should be killed after a 
failed (or timed out) stop, and the quorate node should resume behaving 
normally.

HTH,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tserong at suse.com