[Pacemaker] OCFS2 problems when connectivity lost

Ivan Savčić | Epix ivan.savcic at epix.rs
Wed Dec 21 11:35:10 EST 2011


On 21.12.2011 13:07, Tim Serong wrote:
> My guess would be:
>
> The filesystem can't stop on the non-quorate node, because the network
> connection is down, so DLM can't do its thing.

Ok.


> The filesystem is probably frozen on the quorate node, because of loss
> of DLM comms.

Ok, same problem as above then.


> If STONITH is configured, the non-quorate node should be killed after a
> failed (or timed out) stop, and the quorate node should resume behaving
> normally.
>
> HTH,
>
> Tim

But lost DLM comm leads to *both* nodes hanging: the one in the process 
of being shut down by Pacemaker (because of lost quorum) and the one 
which is in the partition with quorum (and thus should live).

My point is that at least one OCFS2 node (the one in partition with 
quorum) should somehow survive the lost comm and stay healthy, but DLM 
(or something else) gets "stuck" and they both hang. That's the problem.


Ivan




More information about the Pacemaker mailing list