[ClusterLabs] DLM not working on my GFS2/pacemaker cluster

daniel at benoy.name daniel at benoy.name
Tue Jan 19 10:20:32 EST 2016


http://pastebin.com/qaeEAFWz

On 2016-01-19 09:49, emmanuel segura wrote:
> dlm_tool dump ?
> 
> 2016-01-19 15:25 GMT+01:00  <daniel at benoy.name>:
>> Yes, fencing is working, and SELinux is disabled.
>> 
>> What configuration details do you require?
>> 
>> Here's my corosync.conf: http://pastebin.com/SD1Gbdj0
>> Here's my output from 'crm configure show': 
>> http://pastebin.com/eAiq2yJ9
>> 
>> Another cluster is running fine with an identical configuration.
>> 
>> On 2016-01-19 03:49, emmanuel segura wrote:
>>> 
>>> please share your cluster config and say if your fencing is working.
>>> 
>>> 2016-01-19 3:47 GMT+01:00  <daniel at benoy.name>:
>>>> 
>>>> One of my clusters is having a problem. It's no longer able to set 
>>>> up its
>>>> GFS2 mounts. I've narrowed the problem down a bit. Here's the output 
>>>> when
>>>> I
>>>> try to start the DLM daemon (Normally this is something
>>>> corosync/pacemaker
>>>> starts up for me, but here it is on the command line for the debug
>>>> output):
>>>>   # dlm_controld -D -q 04561 dlm_controld 4.0.1 started
>>>>   4561 our_nodeid 168528918
>>>>   4561 found /dev/misc/dlm-control minor 56
>>>>   4561 found /dev/misc/dlm-monitor minor 55
>>>>   4561 found /dev/misc/dlm_plock minor 54
>>>>   4561 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
>>>>   4561 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
>>>>   4561 cmap totem.rrp_mode = 'none'
>>>>   4561 set protocol 0
>>>>   4561 set recover_callbacks 1
>>>>   4561 cmap totem.cluster_name = 'cwwba'
>>>>   4561 set cluster_name cwwba
>>>>   4561 /dev/misc/dlm-monitor fd 11
>>>>   4561 cluster quorum 1 seq 672 nodes 2
>>>>   4561 cluster node 168528918 added seq 672
>>>>   4561 set_configfs_node 168528918 10.11.140.22 local 1
>>>>   4561 /sys/kernel/config/dlm/cluster/comms/168528918/addr: open 
>>>> failed:
>>>> 1
>>>>   4561 cluster node 168528919 added seq 672
>>>>   4561 set_configfs_node 168528919 10.11.140.23 local 0
>>>>   4561 /sys/kernel/config/dlm/cluster/comms/168528919/addr: open 
>>>> failed:
>>>> 1
>>>>   4561 cpg_join dlm:controld ...
>>>>   4561 setup_cpg_daemon 13
>>>>   4561 dlm:controld conf 1 1 0 memb 168528918 join 168528918 left
>>>>   4561 daemon joined 168528918
>>>>   4561 fence work wait for cluster ringid
>>>>   4561 dlm:controld ring 168528918:672 2 memb 168528918 168528919
>>>>   4561 fence_in_progress_unknown 0 startup
>>>>   4561 receive_protocol 168528918 max 3.1.1.0 run 0.0.0.0
>>>>   4561 daemon node 168528918 prot max 0.0.0.0 run 0.0.0.0
>>>>   4561 daemon node 168528918 save max 3.1.1.0 run 0.0.0.0
>>>>   4561 set_protocol member_count 1 propose daemon 3.1.1 kernel 1.1.1
>>>>   4561 receive_protocol 168528918 max 3.1.1.0 run 3.1.1.0
>>>>   4561 daemon node 168528918 prot max 3.1.1.0 run 0.0.0.0
>>>>   4561 daemon node 168528918 save max 3.1.1.0 run 3.1.1.0
>>>>   4561 run protocol from nodeid 168528918
>>>>   4561 daemon run 3.1.1 max 3.1.1 kernel run 1.1.1 max 1.1.1
>>>>   4561 plocks 14
>>>>   4561 receive_protocol 168528918 max 3.1.1.0 run 3.1.1.0
>>>> 
>>>> As you can see, it's trying to configure the node addresses, but 
>>>> it's
>>>> unable
>>>> to write to the 'addr' file under the /sys/kernel/config configfs 
>>>> tree
>>>> (See
>>>> the 'open failed' lines above). I have no idea why. dmesg isn't 
>>>> saying
>>>> anything. Nothing is telling me why it doesn't want me writing 
>>>> there. And
>>>> I
>>>> can confirm this behavior on the prompt as well.
>>>> 
>>>> Trying to start CLVM results in complaints about the node not having 
>>>> an
>>>> address set, which makes sense given the
>>>> 
>>>> Here's the exact same command run twice. First, on a very similarly
>>>> configured cluster (which is currently running):
>>>>   # cat /sys/kernel/config/dlm/cluster/comms/169446438/addrcat
>>>>   cat: /sys/kernel/config/dlm/cluster/comms/169446438/addr: 
>>>> Permission
>>>> denied
>>>> (That's what I expect to see. It's a write-only file.)
>>>> 
>>>> And now on this messed up cluster:
>>>>   # cat /sys/kernel/config/dlm/cluster/comms/168528918/addr
>>>>   cat: /sys/kernel/config/dlm/cluster/comms/168528918/addr: 
>>>> Operation not
>>>> permitted
>>>> 
>>>> Why 'operation not permitted'? dmesg isn't telling me anything at 
>>>> all,
>>>> and I
>>>> don't see any way to get the kernel to spit out some kind of 
>>>> explanation
>>>> for
>>>> why it's blocking me. Can anyone help? At least point me in a 
>>>> direction
>>>> where I can get the system to give me some indication why it's 
>>>> behaving
>>>> this
>>>> way?
>>>> 
>>>> I'm running Ubuntu 14.04, and I've posted this on the Ubuntu forums 
>>>> as
>>>> well:
>>>> http://ubuntuforums.org/showthread.php?t=2310383
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> http://clusterlabs.org/mailman/listinfo/users
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: 
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 





More information about the Users mailing list