[ClusterLabs] stonith - no route to host

Mon Jun 15 23:59:43 EDT 2015

> On 16 Jun 2015, at 12:00 am, Oscar Salvador <osalvador.vilardaga at gmail.com> wrote:
> 
> Hi,
> 
> I've configured a fencing with libvirt, but I'm having some problem with stonith, due to the error "no route to host”

That message is a bit wonky.
What it really means is that there were no devices that advertise the ability to fence that node.

In this case, pacemaker wants to fence “server” but hostlist is set to server.fqdn 
Drop the .fqdn and it should work

> 
> Config:
> 
> node 1053402612: server01
> node 1053402613: server02 \
>         attributes standby=off
> primitive IP-rsc_nginx IPaddr2 \
>         params ip=xx.xx.xx.xx nic=eth0 cidr_netmask=xx.xx.xy.xx \
>         meta migration-threshold=2 \
>         op monitor interval=20 timeout=60 on-fail=restart
> primitive Nginx-rsc nginx \
>         meta migration-threshold=2 \
>         op monitor interval=20 timeout=60 on-fail=restart
> primitive p_fence_server01 stonith:external/libvirt \
>         params hostlist=server01.fqdn hypervisor_uri="qemu+tls://virtnode01:16514/system"
> primitive p_fence_testlb02 stonith:external/libvirt \
>         params hostlist=server02.fqdn hypervisor_uri="qemu+tls://virtnode02:16514/system"
> location l_fence_server01 p_fence_server01 -inf: server01
> location l_fence_testlb02 p_fence_testlb02 -inf: server02
> colocation lb-loc inf: IP-rsc_nginx Nginx-rsc
> order lb-ord inf: IP-rsc_nginx Nginx-rsc
> property cib-bootstrap-options: \
>         stonith-enabled=true \
>         no-quorum-policy=ignore \
>         default-resource-stickiness=100 \
>         last-lrm-refresh=1434360625 \
>         dc-version=1.1.12-561c4cf \
>         cluster-infrastructure=corosync
> 
> 
> As you see, in hostlist i'm searching for the host+fqdn, since it's the name that you can see doing "virsh list"
> Also, from one node you can ping each other and viceverse doing  only "server0x", you don't need the full domain.
> 
> I was testing stonith, just killing corosync on server02, and I got this error in the logs:
> 
> 
> Jun 15 14:44:45 [1301] server01   stonithd:    debug: stonith_action_async_done:         Child process 18649 performing action 'reboot' exited with rc 1
> Jun 15 14:44:45 [1301] server01   stonithd:     info: update_remaining_timeout:  Attempted to execute agent fence_legacy (reboot) the maximum number of times (2) allowed
> Jun 15 14:44:45 [1301] server01   stonithd:    debug: st_child_done:     Operation 'reboot' on 'p_fence_server02' completed with rc=1 (0 remaining)
> Jun 15 14:44:45 [1301] server01   stonithd:    error: log_operation:     Operation 'reboot' [18649] (call 13 from crmd.1305) for host 'server02' with device 'p_fence_server02' returned: -201 (Generic Pacemaker error) 
> Jun 15 14:44:45 [1301] server01   stonithd:  warning: log_operation:     p_fence_server02:18649 [ Performing: stonith -t external/libvirt -T reset server02 ]
> Jun 15 14:44:45 [1301] server01   stonithd:  warning: log_operation:     p_fence_server02:18649 [ failed: server02 5 ]
> 
> 
> Jun 15 14:44:49 [1301] server01   stonithd:    debug: stonith_command:   Processing st_notify reply 0 from server01 (               0)            
> Jun 15 14:44:49 [1301] server01   stonithd:    debug: process_remote_stonith_exec:       Marking call to reboot for server02 on behalf of crmd.1305 at 4281c4bb-9922-4a4d-97f3-706f7d34ec1c.test-lb0: No route to host (-113) 
> Jun 15 14:44:49 [1301] server01   stonithd:  warning: get_xpath_object:  No match for //@st_delegate in /st-reply
> Jun 15 14:44:49 [1301] server01   stonithd:    error: remote_op_done:    Operation reboot of server02 by server01 for crmd.1305 at server01.4281c4bb: No route to host
> Jun 15 14:44:49 [1301] server01   stonithd:    debug: stonith_command:   Processed st_notify reply from server01: OK (0)
> Jun 15 14:44:49 [1305] server01       crmd:   notice: tengine_stonith_callback:  Stonith operation 13/14:26:0:9234dba0-9b0d-4047-b4df-d05f9430f101: No route to host (-113) 
> Jun 15 14:44:49 [1305] server01       crmd:   notice: tengine_stonith_callback:  Stonith operation 13 for server02 failed (No route to host): aborting transition.
> Jun 15 14:44:49 [1305] server01       crmd:     info: abort_transition_graph:    Transition aborted: Stonith failed (source=tengine_stonith_callback:697, 0)
> Jun 15 14:44:49 [1305] server01       crmd:   notice: tengine_stonith_notify:    Peer server02 was not terminated (reboot) by server01 for server01: No route to host (ref=4281c4bb-9922-4a4d-97f3-706f7d34ec1c) 
> by client crmd.1305
> 
> 
> I tried manually in this way:
> 
> stonith_admin -V -F server02
> 
> I got the same error, but if I try with the fqdn, like:
> 
> stonith_admin -V -F server02+fqdn
> 
> Then it works. I don't know why pacemaker can't resolve the host without the fqdn:
> 
> root at server01 ~# host server02
> server02+fqdn has address xx.xx.xx.xx
> root at server01 ~# host server01
> server01+fqdn has address xx.xx.xx.xy
> 
> root at server02 ~# host server02
> server02+fqdn has address xx.xx.xx.xx
> root at server02 ~# host server01
> server01+fqdn has address xx.xx.xx.xy
> 
> 
> 
> Anybody has an idea about that?
> 
> Thank you very much
> Oscar Salvador
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org