[ClusterLabs] Odd clvmd error - clvmd: Unable to create DLM lockspace for CLVM: Address already in use
Christine Caulfield
ccaulfie at redhat.com
Fri Sep 25 07:44:25 UTC 2015
On 25/09/15 00:09, Digimer wrote:
> I had a RHEL 6.7, cman + rgmanager cluster that I've built many times
> before. Oddly, I just hit this error:
>
> ====
> [root at node2 ~]# /etc/init.d/clvmd start
> Starting clvmd: clvmd could not connect to cluster manager
> Consult syslog for more information
> ====
>
> syslog:
> ====
> Sep 24 23:00:30 node2 kernel: dlm: Using SCTP for communications
> Sep 24 23:00:30 node2 clvmd: Unable to create DLM lockspace for CLVM:
> Address already in use
> Sep 24 23:00:30 node2 kernel: dlm: Can't bind to port 21064 addr number 1
This seems to be the key to it. I can't imagine what else would be using
port 21064 (apart from DLM using TCP as well as SCTP but I don' think
that's possible!)
Have a look in netstat and see what else is using that port.
It could be that the socket was in use and is taking a while to shut
down so it might go away on its own too.
Chrissie
> Sep 24 23:00:30 node2 kernel: dlm: cannot start dlm lowcomms -98
> ====
>
> There are no iptables rules:
>
> ====
> [root at node2 ~]# iptables-save
> ====
>
> And there are no DLM lockspaces, either:
>
> ====
> [root at node2 ~]# dlm_tool ls
> [root at node2 ~]#
> ====
>
> I tried withdrawing the node from the cluster entirely, the started cman
> alone and tried to start clvmd, same issue.
>
> Pinging between the two nodes seems OK:
>
> ====
> [root at node1 ~]# uname -n
> node1.ccrs.bcn
> [root at node1 ~]# ping -c 2 node1.ccrs.bcn
> PING node1.bcn (10.20.10.1) 56(84) bytes of data.
> 64 bytes from node1.bcn (10.20.10.1): icmp_seq=1 ttl=64 time=0.015 ms
> 64 bytes from node1.bcn (10.20.10.1): icmp_seq=2 ttl=64 time=0.017 ms
>
> --- node1.bcn ping statistics ---
> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms
> rtt min/avg/max/mdev = 0.015/0.016/0.017/0.001 ms
> ====
> [root at node2 ~]# uname -n
> node2.ccrs.bcn
> [root at node2 ~]# ping -c 2 node1.ccrs.bcn
> PING node1.bcn (10.20.10.1) 56(84) bytes of data.
> 64 bytes from node1.bcn (10.20.10.1): icmp_seq=1 ttl=64 time=0.079 ms
> 64 bytes from node1.bcn (10.20.10.1): icmp_seq=2 ttl=64 time=0.076 ms
>
> --- node1.bcn ping statistics ---
> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
> rtt min/avg/max/mdev = 0.076/0.077/0.079/0.008 ms
> ====
>
> I have RRP configured and pings work on the second network, too:
>
> ====
> [root at node1 ~]# corosync-objctl |grep ring -A 5
> totem.interface.ringnumber=0
> totem.interface.bindnetaddr=10.20.10.1
> totem.interface.mcastaddr=239.192.100.163
> totem.interface.mcastport=5405
> totem.interface.member.memberaddr=node1.ccrs.bcn
> totem.interface.member.memberaddr=node2.ccrs.bcn
> totem.interface.ringnumber=1
> totem.interface.bindnetaddr=10.10.10.1
> totem.interface.mcastaddr=239.192.100.164
> totem.interface.mcastport=5405
> totem.interface.member.memberaddr=node1.sn
> totem.interface.member.memberaddr=node2.sn
>
> [root at node1 ~]# ping -c 2 node2.sn
> PING node2.sn (10.10.10.2) 56(84) bytes of data.
> 64 bytes from node2.sn (10.10.10.2): icmp_seq=1 ttl=64 time=0.111 ms
> 64 bytes from node2.sn (10.10.10.2): icmp_seq=2 ttl=64 time=0.120 ms
>
> --- node2.sn ping statistics ---
> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
> rtt min/avg/max/mdev = 0.111/0.115/0.120/0.011 ms
> ====
> [root at node2 ~]# ping -c 2 node1.sn
> PING node1.sn (10.10.10.1) 56(84) bytes of data.
> 64 bytes from node1.sn (10.10.10.1): icmp_seq=1 ttl=64 time=0.079 ms
> 64 bytes from node1.sn (10.10.10.1): icmp_seq=2 ttl=64 time=0.171 ms
>
> --- node1.sn ping statistics ---
> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms
> rtt min/avg/max/mdev = 0.079/0.125/0.171/0.046 ms
> ====
>
> Here is the cluster.conf:
>
> ====
> [root at node1 ~]# cat /etc/cluster/cluster.conf
> <?xml version="1.0"?>
> <cluster name="ccrs" config_version="1">
> <cman expected_votes="1" two_node="1" transport="udpu" />
> <clusternodes>
> <clusternode name="node1.ccrs.bcn" nodeid="1">
> <altname name="node1.sn" />
> <fence>
> <method name="ipmi">
> <device name="ipmi_n01" ipaddr="10.250.199.15" login="admin"
> passwd="secret" delay="15" action="reboot" />
> </method>
> <method name="pdu">
> <device name="pdu01" port="1" action="reboot" />
> <device name="pdu02" port="1" action="reboot" />
> </method>
> </fence>
> </clusternode>
> <clusternode name="node2.ccrs.bcn" nodeid="2">
> <altname name="node2.sn" />
> <fence>
> <method name="ipmi">
> <device name="ipmi_n02" ipaddr="10.250.199.17" login="admin"
> passwd="secret" action="reboot" />
> </method>
> <method name="pdu">
> <device name="pdu01" port="2" action="reboot" />
> <device name="pdu02" port="2" action="reboot" />
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <fencedevices>
> <fencedevice name="ipmi_n01" agent="fence_ipmilan" />
> <fencedevice name="ipmi_n02" agent="fence_ipmilan" />
> <fencedevice name="pdu01" agent="fence_raritan_snmp" ipaddr="pdu1A" />
> <fencedevice name="pdu02" agent="fence_raritan_snmp" ipaddr="pdu1B" />
> <fencedevice name="pdu03" agent="fence_raritan_snmp" ipaddr="pdu2A" />
> <fencedevice name="pdu04" agent="fence_raritan_snmp" ipaddr="pdu2B" />
> </fencedevices>
> <fence_daemon post_join_delay="30" />
> <totem rrp_mode="passive" secauth="off"/>
> <rm log_level="5">
> <resources>
> <script file="/etc/init.d/drbd" name="drbd"/>
> <script file="/etc/init.d/wait-for-drbd" name="wait-for-drbd"/>
> <script file="/etc/init.d/clvmd" name="clvmd"/>
> <clusterfs device="/dev/node1_vg0/shared" force_unmount="1"
> fstype="gfs2" mountpoint="/shared" name="sharedfs" />
> <script file="/etc/init.d/libvirtd" name="libvirtd"/>
> </resources>
> <failoverdomains>
> <failoverdomain name="only_n01" nofailback="1" ordered="0"
> restricted="1">
> <failoverdomainnode name="node1.ccrs.bcn"/>
> </failoverdomain>
> <failoverdomain name="only_n02" nofailback="1" ordered="0"
> restricted="1">
> <failoverdomainnode name="node2.ccrs.bcn"/>
> </failoverdomain>
> <failoverdomain name="primary_n01" nofailback="1" ordered="1"
> restricted="1">
> <failoverdomainnode name="node1.ccrs.bcn" priority="1"/>
> <failoverdomainnode name="node2.ccrs.bcn" priority="2"/>
> </failoverdomain>
> <failoverdomain name="primary_n02" nofailback="1" ordered="1"
> restricted="1">
> <failoverdomainnode name="node1.ccrs.bcn" priority="2"/>
> <failoverdomainnode name="node2.ccrs.bcn" priority="1"/>
> </failoverdomain>
> </failoverdomains>
> <service name="storage_n01" autostart="1" domain="only_n01"
> exclusive="0" recovery="restart">
> <script ref="drbd">
> <script ref="wait-for-drbd">
> <script ref="clvmd">
> <clusterfs ref="sharedfs"/>
> </script>
> </script>
> </script>
> </service>
> <service name="storage_n02" autostart="1" domain="only_n02"
> exclusive="0" recovery="restart">
> <script ref="drbd">
> <script ref="wait-for-drbd">
> <script ref="clvmd">
> <clusterfs ref="sharedfs"/>
> </script>
> </script>
> </script>
> </service>
> <service name="libvirtd_n01" autostart="1" domain="only_n01"
> exclusive="0" recovery="restart">
> <script ref="libvirtd"/>
> </service>
> <service name="libvirtd_n02" autostart="1" domain="only_n02"
> exclusive="0" recovery="restart">
> <script ref="libvirtd"/>
> </service>
> </rm>
> </cluster>
> ====
>
> Nothing special there at all.
>
> While writing this email though, I saw this on the other node:
>
> ====
> Sep 24 23:03:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 14e
> Sep 24 23:03:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 14e
> Sep 24 23:03:49 node1 corosync[4770]: [TOTEM ] Retransmit List: 158
> Sep 24 23:03:49 node1 corosync[4770]: [TOTEM ] Retransmit List: 15a
> Sep 24 23:03:49 node1 corosync[4770]: [TOTEM ] Retransmit List: 15a
> Sep 24 23:03:59 node1 corosync[4770]: [TOTEM ] Retransmit List: 161
> Sep 24 23:03:59 node1 corosync[4770]: [TOTEM ] Retransmit List: 161
> Sep 24 23:03:59 node1 corosync[4770]: [TOTEM ] Retransmit List: 161
> Sep 24 23:03:59 node1 corosync[4770]: [TOTEM ] Retransmit List: 161 163
> Sep 24 23:03:59 node1 corosync[4770]: [TOTEM ] Retransmit List: 163
> Sep 24 23:04:19 node1 corosync[4770]: [TOTEM ] Retransmit List: 177
> Sep 24 23:04:19 node1 corosync[4770]: [TOTEM ] Retransmit List: 177
> Sep 24 23:04:19 node1 corosync[4770]: [TOTEM ] Retransmit List: 179
> Sep 24 23:04:19 node1 corosync[4770]: [TOTEM ] Retransmit List: 179
> Sep 24 23:04:29 node1 corosync[4770]: [TOTEM ] Retransmit List: 181
> Sep 24 23:04:29 node1 corosync[4770]: [TOTEM ] Retransmit List: 181
> Sep 24 23:04:29 node1 corosync[4770]: [TOTEM ] Retransmit List: 181
> Sep 24 23:04:29 node1 corosync[4770]: [TOTEM ] Retransmit List: 183
> Sep 24 23:04:29 node1 corosync[4770]: [TOTEM ] Retransmit List: 183
> Sep 24 23:04:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 18c
> Sep 24 23:04:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 18c
> Sep 24 23:04:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 18c 18e
> Sep 24 23:04:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 18e
> Sep 24 23:07:20 node1 corosync[4770]: [TOTEM ] Retransmit List: 23c
> Sep 24 23:07:20 node1 corosync[4770]: [TOTEM ] Retransmit List: 23c
> Sep 24 23:07:20 node1 corosync[4770]: [TOTEM ] Retransmit List: 23c
> Sep 24 23:07:20 node1 corosync[4770]: [TOTEM ] Retransmit List: 23e
> Sep 24 23:07:20 node1 corosync[4770]: [TOTEM ] Retransmit List: 23e
> Sep 24 23:07:30 node1 corosync[4770]: [TOTEM ] Retransmit List: 247
> Sep 24 23:07:30 node1 corosync[4770]: [TOTEM ] Retransmit List: 247
> Sep 24 23:07:30 node1 corosync[4770]: [TOTEM ] Retransmit List: 249
> Sep 24 23:07:30 node1 corosync[4770]: [TOTEM ] Retransmit List: 24b
> Sep 24 23:07:30 node1 corosync[4770]: [TOTEM ] Retransmit List: 24b
> Sep 24 23:07:40 node1 corosync[4770]: [TOTEM ] Retransmit List: 252
> Sep 24 23:07:40 node1 corosync[4770]: [TOTEM ] Retransmit List: 252
> Sep 24 23:07:40 node1 corosync[4770]: [TOTEM ] Retransmit List: 254
> Sep 24 23:07:40 node1 corosync[4770]: [TOTEM ] Retransmit List: 254
> ====
>
> Certainly *looks* like a network problem, but I can't see what's
> wrong... Any ideas?
>
> Thanks!
>
More information about the Users
mailing list