[ClusterLabs] Any CLVM/DLM users around?

Mon Oct 1 16:55:07 UTC 2018

>
> Fencing in clustering is always required, but unlike pacemaker that lets
> you turn it off and take your chances, DLM doesn't.

As a matter of fact, DLM has a setting "enable_fencing=0|1" for what that's
worth.

> You must have
> working fencing for DLM (and anything using it) to function correctly.
>

We do have fencing enabled in the cluster; we've tested both node level
fencing and resource fencing; DLM behaved identically in both scenarios,
until we set it to 'enable_fencing=0' in the dlm.conf file.

> Basically, cluster config changes (node declared lost), dlm informed and
> blocks, fence attempt begins and loops until it succeeds, on success,
> informs DLM, dlm reaps locks held by the lost node and normal operation
> continues.
>
This isn't quite what I was seeing in the logs.  The "failed" node would be
fenced off, pacemaker appeared to be sane, reporting services running on
the running nodes, but once the failed node was seen as missing by dlm
(dlm_controld), dlm would request fencing, from what I can tell by the log
entry.  Here is an example of the suspect log entry:
Sep 26 09:41:35 pcmk-test-1 dlm_controld[837]: 38 fence request 2 pid 1446
startup time 1537969264 fence_all dlm_stonith

> This isn't a question of node count or other configuration concerns.
> It's simply that you must have proper fencing for DLM.

Can you speak more to what "proper fencing" is for DLM?

Best,
-Pat

On Mon, Oct 1, 2018 at 12:30 PM Digimer <lists at alteeve.ca> wrote:

> On 2018-10-01 12:04 PM, Ferenc Wágner wrote:
> > Patrick Whitney <pwhitney at luminoso.com> writes:
> >
> >> I have a two node (test) cluster running corosync/pacemaker with DLM
> >> and CLVM.
> >>
> >> I was running into an issue where when one node failed, the remaining
> node
> >> would appear to do the right thing, from the pcmk perspective, that is.
> >> It would  create a new cluster (of one) and fence the other node, but
> >> then, rather surprisingly, DLM would see the other node offline, and it
> >> would go offline itself, abandoning the lockspace.
> >>
> >> I changed my DLM settings to "enable_fencing=0", disabling DLM fencing,
> and
> >> our tests are now working as expected.
> >
> > I'm running a larger Pacemaker cluster with standalone DLM + cLVM (that
> > is, they are started by systemd, not by Pacemaker).  I've seen weird DLM
> > fencing behavior, but not what you describe above (though I ran with
> > more than two nodes from the very start).  Actually, I don't even
> > understand how it occured to you to disable DLM fencing to fix that...
>
> Fencing in clustering is always required, but unlike pacemaker that lets
> you turn it off and take your chances, DLM doesn't. You must have
> working fencing for DLM (and anything using it) to function correctly.
>
> Basically, cluster config changes (node declared lost), dlm informed and
> blocks, fence attempt begins and loops until it succeeds, on success,
> informs DLM, dlm reaps locks held by the lost node and normal operation
> continues.
>
> This isn't a question of node count or other configuration concerns.
> It's simply that you must have proper fencing for DLM.
>
> >> I'm a little concern I have masked an issue by doing this, as in all
> >> of the tutorials and docs I've read, there is no mention of having to
> >> configure DLM whatsoever.
> >
> > Unfortunately it's very hard to come by any reliable info about DLM.  I
> > had a couple of enlightening exchanges with David Teigland (its primary
> > author) on this list, he is very helpful indeed, but I'm still very far
> > from having a working understanding of it.
> >
> > But I've been running with --enable_fencing=0 for years without issues,
> > leaving all fencing to Pacemaker.  Note that manual cLVM operations are
> > the only users of DLM here, so delayed fencing does not cause any
> > problems, the cluster services do not depend on DLM being operational (I
> > mean it can stay frozen for several days -- as it happened in a couple
> > of pathological cases).  GFS2 would be a very different thing, I guess.
> >
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com/w/
> "I am, somehow, less interested in the weight and convolutions of
> Einstein’s brain than in the near certainty that people of equal talent
> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
>

-- 
Patrick Whitney
DevOps Engineer -- Tools
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20181001/001ec217/attachment-0001.html>