[ClusterLabs] crmd error: Cannot route message to unknown node

Thu Apr 7 21:07:04 UTC 2016

On 04/07/2016 03:22 PM, Ferenc Wágner wrote:
> Hi,
> 
> On a freshly rebooted cluster node (after crm_mon reports it as
> 'online'), I get the following:
> 
> wferi at vhbl08:~$ sudo crm_resource -r vm-cedar --cleanup
> Cleaning up vm-cedar on vhbl03, removing fail-count-vm-cedar
> Cleaning up vm-cedar on vhbl04, removing fail-count-vm-cedar
> Cleaning up vm-cedar on vhbl05, removing fail-count-vm-cedar
> Cleaning up vm-cedar on vhbl06, removing fail-count-vm-cedar
> Cleaning up vm-cedar on vhbl07, removing fail-count-vm-cedar
> Cleaning up vm-cedar on vhbl08, removing fail-count-vm-cedar
> Waiting for 6 replies from the CRMd..No messages received in 60 seconds.. aborting
> 
> Meanwhile, this is written into syslog (I can also provide info level
> logs if necessary):
> 
> 22:03:02 vhbl08 crmd[8990]:    error: Cannot route message to unknown node vhbl03
> 22:03:02 vhbl08 crmd[8990]:    error: Cannot route message to unknown node vhbl04
> 22:03:02 vhbl08 crmd[8990]:    error: Cannot route message to unknown node vhbl06
> 22:03:02 vhbl08 crmd[8990]:    error: Cannot route message to unknown node vhbl07

This message can only occur when the node name is not present in this
node's peer cache.

I'm guessing that since you don't have node names in corosync, the cache
entries only have node IDs at this point. I don't know offhand when
pacemaker would figure out the association, but I bet it would be
possible to ensure it by running some command beforehand, maybe crm_node -l?

> 22:03:04 vhbl08 crmd[8990]:   notice: Operation vm-cedar_monitor_0: not running (node=vhbl08, call=626, rc=7, cib-update=169, confirmed=true)
> 
> For background:
> 
> wferi at vhbl08:~$ sudo cibadmin --scope=nodes -Q
> <nodes>
>   <node id="167773707" uname="vhbl05">
>     <utilization id="nodes-167773707-utilization">
>       <nvpair id="nodes-167773707-utilization-memoryMiB" name="memoryMiB" value="124928"/>
>     </utilization>
>     <instance_attributes id="nodes-167773707"/>
>   </node>
>   <node id="167773708" uname="vhbl06">
>     <utilization id="nodes-167773708-utilization">
>       <nvpair id="nodes-167773708-utilization-memoryMiB" name="memoryMiB" value="124928"/>
>     </utilization>
>     <instance_attributes id="nodes-167773708"/>
>   </node>
>   <node id="167773706" uname="vhbl04">
>     <utilization id="nodes-167773706-utilization">
>       <nvpair id="nodes-167773706-utilization-memoryMiB" name="memoryMiB" value="124928"/>
>     </utilization>
>     <instance_attributes id="nodes-167773706"/>
>   </node>
>   <node id="167773705" uname="vhbl03">
>     <utilization id="nodes-167773705-utilization">
>       <nvpair id="nodes-167773705-utilization-memoryMiB" name="memoryMiB" value="124928"/>
>     </utilization>
>     <instance_attributes id="nodes-167773705"/>
>   </node>
>   <node id="167773709" uname="vhbl07">
>     <utilization id="nodes-167773709-utilization">
>       <nvpair id="nodes-167773709-utilization-memoryMiB" name="memoryMiB" value="124928"/>
>     </utilization>
>     <instance_attributes id="nodes-167773709"/>
>   </node>
>   <node id="167773710" uname="vhbl08">
>     <utilization id="nodes-167773710-utilization">
>       <nvpair id="nodes-167773710-utilization-memoryMiB" name="memoryMiB" value="124928"/>
>     </utilization>
>   </node>
> </nodes>
> 
> Why does this happen?  I've got no node names in corosync.conf, but
> Pacemaker defaults to uname -n all right.
>