[ClusterLabs] Odd clvmd error - clvmd: Unable to create DLM lockspace for CLVM: Address already in use

Fri Sep 25 13:31:21 UTC 2015

On 25/09/15 03:44 AM, Christine Caulfield wrote:
> On 25/09/15 00:09, Digimer wrote:
>> I had a RHEL 6.7, cman + rgmanager cluster that I've built many times
>> before. Oddly, I just hit this error:
>>
>> ====
>> [root at node2 ~]# /etc/init.d/clvmd start
>> Starting clvmd: clvmd could not connect to cluster manager
>> Consult syslog for more information
>> ====
>>
>> syslog:
>> ====
>> Sep 24 23:00:30 node2 kernel: dlm: Using SCTP for communications
>> Sep 24 23:00:30 node2 clvmd: Unable to create DLM lockspace for CLVM:
>> Address already in use
>> Sep 24 23:00:30 node2 kernel: dlm: Can't bind to port 21064 addr number 1
> 
> This seems to be the key to it. I can't imagine what else would be using
> port 21064 (apart from DLM using TCP as well as SCTP but I don' think
> that's possible!)
> 
> Have a look in netstat and see what else is using that port.
> 
> It could be that the socket was in use and is taking a while to shut
> down so it might go away on its own too.
> 
> Chrissie

netstat and lsof showed nothing using it. Looking at the logs (of our
installer), it had manually started drbd + clvmd + gfs2 just fine. Then
it asked rgmanager to start which tried to tear down the already running
storage services before (re)starting them. It looks like the (drbd)
UpToDate node got called to stop before the Inconsistent node, which
failed because the Inconsistent node needed the UpToDate node (gfs2 was
still mounted so clvmd held open the drbd device). I'm not clear on what
happened next, but things went sideways.

So I manually stopped everything, including clvmd/cman, and it still
threw that error. Eventually I rebooted both nodes and it went back to
working.

Odd.

>> Sep 24 23:00:30 node2 kernel: dlm: cannot start dlm lowcomms -98
>> ====
>>
>> There are no iptables rules:
>>
>> ====
>> [root at node2 ~]# iptables-save
>> ====
>>
>> And there are no DLM lockspaces, either:
>>
>> ====
>> [root at node2 ~]# dlm_tool ls
>> [root at node2 ~]#
>> ====
>>
>> I tried withdrawing the node from the cluster entirely, the started cman
>> alone and tried to start clvmd, same issue.
>>
>> Pinging between the two nodes seems OK:
>>
>> ====
>> [root at node1 ~]# uname -n
>> node1.ccrs.bcn
>> [root at node1 ~]# ping -c 2 node1.ccrs.bcn
>> PING node1.bcn (10.20.10.1) 56(84) bytes of data.
>> 64 bytes from node1.bcn (10.20.10.1): icmp_seq=1 ttl=64 time=0.015 ms
>> 64 bytes from node1.bcn (10.20.10.1): icmp_seq=2 ttl=64 time=0.017 ms
>>
>> --- node1.bcn ping statistics ---
>> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms
>> rtt min/avg/max/mdev = 0.015/0.016/0.017/0.001 ms
>> ====
>> [root at node2 ~]# uname -n
>> node2.ccrs.bcn
>> [root at node2 ~]# ping -c 2 node1.ccrs.bcn
>> PING node1.bcn (10.20.10.1) 56(84) bytes of data.
>> 64 bytes from node1.bcn (10.20.10.1): icmp_seq=1 ttl=64 time=0.079 ms
>> 64 bytes from node1.bcn (10.20.10.1): icmp_seq=2 ttl=64 time=0.076 ms
>>
>> --- node1.bcn ping statistics ---
>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>> rtt min/avg/max/mdev = 0.076/0.077/0.079/0.008 ms
>> ====
>>
>> I have RRP configured and pings work on the second network, too:
>>
>> ====
>> [root at node1 ~]# corosync-objctl |grep ring -A 5
>> totem.interface.ringnumber=0
>> totem.interface.bindnetaddr=10.20.10.1
>> totem.interface.mcastaddr=239.192.100.163
>> totem.interface.mcastport=5405
>> totem.interface.member.memberaddr=node1.ccrs.bcn
>> totem.interface.member.memberaddr=node2.ccrs.bcn
>> totem.interface.ringnumber=1
>> totem.interface.bindnetaddr=10.10.10.1
>> totem.interface.mcastaddr=239.192.100.164
>> totem.interface.mcastport=5405
>> totem.interface.member.memberaddr=node1.sn
>> totem.interface.member.memberaddr=node2.sn
>>
>> [root at node1 ~]# ping -c 2 node2.sn
>> PING node2.sn (10.10.10.2) 56(84) bytes of data.
>> 64 bytes from node2.sn (10.10.10.2): icmp_seq=1 ttl=64 time=0.111 ms
>> 64 bytes from node2.sn (10.10.10.2): icmp_seq=2 ttl=64 time=0.120 ms
>>
>> --- node2.sn ping statistics ---
>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>> rtt min/avg/max/mdev = 0.111/0.115/0.120/0.011 ms
>> ====
>> [root at node2 ~]# ping -c 2 node1.sn
>> PING node1.sn (10.10.10.1) 56(84) bytes of data.
>> 64 bytes from node1.sn (10.10.10.1): icmp_seq=1 ttl=64 time=0.079 ms
>> 64 bytes from node1.sn (10.10.10.1): icmp_seq=2 ttl=64 time=0.171 ms
>>
>> --- node1.sn ping statistics ---
>> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms
>> rtt min/avg/max/mdev = 0.079/0.125/0.171/0.046 ms
>> ====
>>
>> Here is the cluster.conf:
>>
>> ====
>> [root at node1 ~]# cat /etc/cluster/cluster.conf
>> <?xml version="1.0"?>
>> <cluster name="ccrs" config_version="1">
>> 	<cman expected_votes="1" two_node="1" transport="udpu" />
>> 	<clusternodes>
>> 		<clusternode name="node1.ccrs.bcn" nodeid="1">
>> 			<altname name="node1.sn" />
>> 			<fence>
>> 				<method name="ipmi">
>> 					<device name="ipmi_n01" ipaddr="10.250.199.15" login="admin"
>> passwd="secret" delay="15" action="reboot" />
>> 				</method>
>> 				<method name="pdu">
>> 					<device name="pdu01" port="1" action="reboot" />
>> 					<device name="pdu02" port="1" action="reboot" />
>> 				</method>
>> 			</fence>
>> 		</clusternode>
>> 		<clusternode name="node2.ccrs.bcn" nodeid="2">
>> 			<altname name="node2.sn" />
>> 			<fence>
>> 				<method name="ipmi">
>> 					<device name="ipmi_n02" ipaddr="10.250.199.17" login="admin"
>> passwd="secret" action="reboot" />
>> 				</method>
>> 				<method name="pdu">
>> 					<device name="pdu01" port="2" action="reboot" />
>> 					<device name="pdu02" port="2" action="reboot" />
>> 				</method>
>> 			</fence>
>> 		</clusternode>
>> 	</clusternodes>
>> 	<fencedevices>
>> 		<fencedevice name="ipmi_n01" agent="fence_ipmilan" />
>> 		<fencedevice name="ipmi_n02" agent="fence_ipmilan" />
>> 		<fencedevice name="pdu01" agent="fence_raritan_snmp" ipaddr="pdu1A" />
>> 		<fencedevice name="pdu02" agent="fence_raritan_snmp" ipaddr="pdu1B" />
>> 		<fencedevice name="pdu03" agent="fence_raritan_snmp" ipaddr="pdu2A" />
>> 		<fencedevice name="pdu04" agent="fence_raritan_snmp" ipaddr="pdu2B" />
>> 	</fencedevices>
>> 	<fence_daemon post_join_delay="30" />
>> 	<totem rrp_mode="passive" secauth="off"/>
>> 	<rm log_level="5">
>> 		<resources>
>> 			<script file="/etc/init.d/drbd" name="drbd"/>
>> 			<script file="/etc/init.d/wait-for-drbd" name="wait-for-drbd"/>
>> 			<script file="/etc/init.d/clvmd" name="clvmd"/>
>> 			<clusterfs device="/dev/node1_vg0/shared" force_unmount="1"
>> fstype="gfs2" mountpoint="/shared" name="sharedfs" />
>> 			<script file="/etc/init.d/libvirtd" name="libvirtd"/>
>> 		</resources>
>> 		<failoverdomains>
>> 			<failoverdomain name="only_n01" nofailback="1" ordered="0"
>> restricted="1">
>> 				<failoverdomainnode name="node1.ccrs.bcn"/>
>> 			</failoverdomain>
>> 			<failoverdomain name="only_n02" nofailback="1" ordered="0"
>> restricted="1">
>> 				<failoverdomainnode name="node2.ccrs.bcn"/>
>> 			</failoverdomain>
>> 			<failoverdomain name="primary_n01" nofailback="1" ordered="1"
>> restricted="1">
>> 				<failoverdomainnode name="node1.ccrs.bcn" priority="1"/>
>> 				<failoverdomainnode name="node2.ccrs.bcn" priority="2"/>
>> 			</failoverdomain>
>> 			<failoverdomain name="primary_n02" nofailback="1" ordered="1"
>> restricted="1">
>> 				<failoverdomainnode name="node1.ccrs.bcn" priority="2"/>
>> 				<failoverdomainnode name="node2.ccrs.bcn" priority="1"/>
>> 			</failoverdomain>
>> 		</failoverdomains>
>> 		<service name="storage_n01" autostart="1" domain="only_n01"
>> exclusive="0" recovery="restart">
>> 			<script ref="drbd">
>> 				<script ref="wait-for-drbd">
>> 					<script ref="clvmd">
>> 						<clusterfs ref="sharedfs"/>
>> 					</script>
>> 				</script>
>> 			</script>
>> 		</service>
>> 		<service name="storage_n02" autostart="1" domain="only_n02"
>> exclusive="0" recovery="restart">
>> 			<script ref="drbd">
>> 				<script ref="wait-for-drbd">
>> 					<script ref="clvmd">
>> 						<clusterfs ref="sharedfs"/>
>> 					</script>
>> 				</script>
>> 			</script>
>> 		</service>
>> 		<service name="libvirtd_n01" autostart="1" domain="only_n01"
>> exclusive="0" recovery="restart">
>> 			<script ref="libvirtd"/>
>> 		</service>
>> 		<service name="libvirtd_n02" autostart="1" domain="only_n02"
>> exclusive="0" recovery="restart">
>> 			<script ref="libvirtd"/>
>> 		</service>
>> 	</rm>
>> </cluster>
>> ====
>>
>> Nothing special there at all.
>>
>> While writing this email though, I saw this on the other node:
>>
>> ====
>> Sep 24 23:03:39 node1 corosync[4770]:   [TOTEM ] Retransmit List: 14e
>> Sep 24 23:03:39 node1 corosync[4770]:   [TOTEM ] Retransmit List: 14e
>> Sep 24 23:03:49 node1 corosync[4770]:   [TOTEM ] Retransmit List: 158
>> Sep 24 23:03:49 node1 corosync[4770]:   [TOTEM ] Retransmit List: 15a
>> Sep 24 23:03:49 node1 corosync[4770]:   [TOTEM ] Retransmit List: 15a
>> Sep 24 23:03:59 node1 corosync[4770]:   [TOTEM ] Retransmit List: 161
>> Sep 24 23:03:59 node1 corosync[4770]:   [TOTEM ] Retransmit List: 161
>> Sep 24 23:03:59 node1 corosync[4770]:   [TOTEM ] Retransmit List: 161
>> Sep 24 23:03:59 node1 corosync[4770]:   [TOTEM ] Retransmit List: 161 163
>> Sep 24 23:03:59 node1 corosync[4770]:   [TOTEM ] Retransmit List: 163
>> Sep 24 23:04:19 node1 corosync[4770]:   [TOTEM ] Retransmit List: 177
>> Sep 24 23:04:19 node1 corosync[4770]:   [TOTEM ] Retransmit List: 177
>> Sep 24 23:04:19 node1 corosync[4770]:   [TOTEM ] Retransmit List: 179
>> Sep 24 23:04:19 node1 corosync[4770]:   [TOTEM ] Retransmit List: 179
>> Sep 24 23:04:29 node1 corosync[4770]:   [TOTEM ] Retransmit List: 181
>> Sep 24 23:04:29 node1 corosync[4770]:   [TOTEM ] Retransmit List: 181
>> Sep 24 23:04:29 node1 corosync[4770]:   [TOTEM ] Retransmit List: 181
>> Sep 24 23:04:29 node1 corosync[4770]:   [TOTEM ] Retransmit List: 183
>> Sep 24 23:04:29 node1 corosync[4770]:   [TOTEM ] Retransmit List: 183
>> Sep 24 23:04:39 node1 corosync[4770]:   [TOTEM ] Retransmit List: 18c
>> Sep 24 23:04:39 node1 corosync[4770]:   [TOTEM ] Retransmit List: 18c
>> Sep 24 23:04:39 node1 corosync[4770]:   [TOTEM ] Retransmit List: 18c 18e
>> Sep 24 23:04:39 node1 corosync[4770]:   [TOTEM ] Retransmit List: 18e
>> Sep 24 23:07:20 node1 corosync[4770]:   [TOTEM ] Retransmit List: 23c
>> Sep 24 23:07:20 node1 corosync[4770]:   [TOTEM ] Retransmit List: 23c
>> Sep 24 23:07:20 node1 corosync[4770]:   [TOTEM ] Retransmit List: 23c
>> Sep 24 23:07:20 node1 corosync[4770]:   [TOTEM ] Retransmit List: 23e
>> Sep 24 23:07:20 node1 corosync[4770]:   [TOTEM ] Retransmit List: 23e
>> Sep 24 23:07:30 node1 corosync[4770]:   [TOTEM ] Retransmit List: 247
>> Sep 24 23:07:30 node1 corosync[4770]:   [TOTEM ] Retransmit List: 247
>> Sep 24 23:07:30 node1 corosync[4770]:   [TOTEM ] Retransmit List: 249
>> Sep 24 23:07:30 node1 corosync[4770]:   [TOTEM ] Retransmit List: 24b
>> Sep 24 23:07:30 node1 corosync[4770]:   [TOTEM ] Retransmit List: 24b
>> Sep 24 23:07:40 node1 corosync[4770]:   [TOTEM ] Retransmit List: 252
>> Sep 24 23:07:40 node1 corosync[4770]:   [TOTEM ] Retransmit List: 252
>> Sep 24 23:07:40 node1 corosync[4770]:   [TOTEM ] Retransmit List: 254
>> Sep 24 23:07:40 node1 corosync[4770]:   [TOTEM ] Retransmit List: 254
>> ====
>>
>> Certainly *looks* like a network problem, but I can't see what's
>> wrong... Any ideas?
>>
>> Thanks!
>>
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?