[ClusterLabs] [Linux-cluster] DLM won't (stay) running
Jason Gauthier
jagauthier at gmail.com
Wed May 9 06:51:03 EDT 2018
On Wed, May 9, 2018 at 6:26 AM, Andrew Price <anprice at redhat.com> wrote:
> [linux-cluster@ isn't really used nowadays; CCing users at clusterlabs]
>
> On 08/05/18 12:18, Jason Gauthier wrote:
>>
>> Greetings,
>>
>> I'm working on a setup of a two-node cluster with shared storage.
>> I've been able to see the storage on both nodes, and appropriate
>> configuration for fencing the bock device.
>>
>> The next step was getting DLM and GFS2 in a clone group to mount the
>> FS on both drives. This is where I am running into trouble.
>>
>> As far as the OS goes, it's debian. I'm using pacemaker, corosync,
>> and crm for cluster management.
>
>
> Is it safe to assume that you're using Debian Wheezy? (The need for
> gfs_controld disappeared in the 3.3 kernel.) As wheezy goes end-of-life at
> the end of the month I would suggest upgrading, you will likely find the
> cluster tools more user friendly and the components more stable.
I am using stretch, which was the challenge at first. I couldn't find
any information about it.
Even as new as Jessie contains gfs2_controld. I could not figure out
how to make it work.
But, yeah, that is now removed.. because it works fine without it.
And the good news is: I messed around with this for quite some time
last night and finally got
everything to come up reliably on both nodes. Even reboots,and
simultaneous reboots.
So, I am pleased! Time for the next part which is building some VMs.
Thanks for the help!
>> At the moment, I've removed the gfs2 parts just to try and get dlm
>> working.
>>
>> My current config looks like this:
>>
>> node 1084772368: alpha
>> node 1084772369: beta
>> primitive p_dlm_controld ocf:pacemaker:controld \
>> op monitor interval=60 timeout=60 \
>> meta target-role=Started args=-K
>> primitive p_gfs_controld ocf:pacemaker:controld \
>> params daemon=gfs_controld \
>> meta target-role=Started
>> primitive stonith_sbd stonith:external/sbd \
>> params pcmk_delay_max=30 sbd_device="/dev/sdb1"
>> group g_gfs2 p_dlm_controld p_gfs_controld
>> clone cl_gfs2 g_gfs2 \
>> meta interleave=true target-role=Started
>> property cib-bootstrap-options: \
>> have-watchdog=false \
>> dc-version=1.1.16-94ff4df \
>> cluster-infrastructure=corosync \
>> cluster-name=zeta \
>> last-lrm-refresh=1525523370 \
>> stonith-enabled=true \
>> stonith-timeout=20s
>>
>> When a bring the resources up, I get a quick blip in my logs.
>> May 8 07:13:58 beta dlm_controld[9425]: 253556 dlm_controld 4.0.7 started
>> May 8 07:14:00 beta kernel: [253558.641658] dlm: closing connection
>> to node 1084772369
>> May 8 07:14:00 beta kernel: [253558.641764] dlm: closing connection
>> to node 1084772368
>>
>>
>> This is the same messaging I see when I run dlm manually and then stop
>> it. My challenge here is that I cannot find out what dlm is doing.
>> I've tried adding -K to /etc/default/dlm, but I don't think that file
>> is being respected. I would like to figure out how to increase the
>> verbose output of dlm_controld so I can see why it won't stay running
>> when it's launched through the cluster. I haven't been able to
>> figure out how to pass arguments directly to the a daemon in the
>> primitive config, if it's even possible. Otherwise, I would try to
>> pass -K there.
>>
>> Thanks!
>>
>> Jason
>>
>
More information about the Users
mailing list