[ClusterLabs] Antw: Re: OCFS2 on cLVM with node waiting for fencing timeout

Tue Oct 11 06:18:31 UTC 2016

{>>>> emmanuel segura <emi2fast at gmail.com> schrieb am 10.10.2016 um 16:49 in
> Nachricht
> <CAE7pJ3CBJR3pctT3N_jaMCXBuUGD3nta=yA8FZNbNfAifK3uXg at mail.gmail.com>:
> 

Node h01 (old DC) was fenced at Oct 10 10:06:33
Node h01 went down around Oct 10 10:06:37.
DLM noticed that on node h05:
Oct 10 10:06:44 h05 cluster-dlm[12063]: dlm_process_node: Removed inactive node 739512321: born-on=3180, last-seen=3208, this-event=3212, last-event=3208
cLVM and OCFS noticed the event also:
Oct 10 10:06:44 h05 ocfs2_controld[12147]: Sending notification of node 739512321 for "490B9FCAFA3D4B2F9A586A5893E00730"
Oct 10 10:06:44 h05 ocfs2_controld[12147]: Notified for "490B9FCAFA3D4B2F9A586A5893E00730", node 739512321, status 0

Similar on node h10 (new DC):
Oct 10 10:06:44 h10 cluster-dlm[32150]: dlm_process_node: Removed inactive node 739512321: born-on=3180, last-seen=3208, this-event=3212, last-event=3208
Oct 10 10:06:44 h10 ocfs2_controld[32271]:   notice: crm_update_peer_state: plugin_handle_membership: Node h01[739512321] - state is now lost (was member)
Oct 10 10:06:44 h10 ocfs2_controld[32271]: node daemon left 739512321
Oct 10 10:06:44 h10 ocfs2_controld[32271]: Sending notification of node 739512321 for "490B9FCAFA3D4B2F9A586A5893E00730"

My point is this: For a resource that can only exclusively run on one node, it's important that the other node is down before taking action. But for cLVM and OCFS2 the resources can run concurrently on each node, so I don't see why every node veirtually freezes until STONITH completed.
If you have a large cluster (maybe 100 nodes), OCFS will be unavailable most of the time if any node fails.

When assuming node h01 still lived when communication failed, wouldn't quorum prevent h01 from doing anything with DLM and OCFS2 anyway?

Regards,
Ulrich