[ClusterLabs] stonith - no route to host

Tue Jun 16 13:30:29 EDT 2015

On 16/06/15 04:18 AM, Oscar Salvador wrote:
> 
> 
> 2015-06-16 5:59 GMT+02:00 Andrew Beekhof <andrew at beekhof.net
> <mailto:andrew at beekhof.net>>:
> 
> 
>     > On 16 Jun 2015, at 12:00 am, Oscar Salvador <osalvador.vilardaga at gmail.com
>     <mailto:osalvador.vilardaga at gmail.com>> wrote:
>     >
>     > Hi,
>     >
>     > I've configured a fencing with libvirt, but I'm having some
>     problem with stonith, due to the error "no route to host”
> 
>     That message is a bit wonky.
>     What it really means is that there were no devices that advertise
>     the ability to fence that node.
> 
>     In this case, pacemaker wants to fence “server” but hostlist is set
>     to server.fqdn
>     Drop the .fqdn and it should work
> 
> 
> Get rid of the +fqdn was not an option, sorry, but I could fix it in
> another way with the help of digimer.
> I've used the fence_virsh, from fence_agents. 
> 
> First of all I configured it in this way:
> 
> /primitive fence_server01 stonith:fence_virsh \
> /
> /        params ipaddr=virtnode01 port=server01.fqdn action=reboot
> login=root passwd=passwd delay=15  \/
> /        op monitor interval=60s /
> /primitive fence_server02 stonith:fence_virsh \/
> /        params ipaddr=virtnode02 port=server02.fqdn action=reboot
> login=root passwd=passwd delay=15  \/
> /        op monitor interval=60s /
> /
> /
> 
> But when I tried to fence a node, I received this errors:
> 
>  1.
>     Jun 16 09:37:59 [1298] server01    pengine:  warning: pe_fence_node:
>         Node server02 will be fenced because p_fence_server01 is thought
>     to be active there
>  2.
>     Jun 16 09:37:59 [1299] server01       crmd:   notice: te_fence_node:
>         Executing reboot fencing operation (12) on server02 (timeout=60000)
>  3.
>     Jun 16 09:37:59 [1295] server01   stonithd:   notice:
>     handle_request:    Client crmd.1299.d339ea94 wants to fence (reboot)
>     'server02' with device '(any)'
>  4.
>     Jun 16 09:37:59 [1295] server01   stonithd:   notice:
>     initiate_remote_stonith_op:        Initiating remote operation
>     reboot for server02: 19fdb8e0-2611-45a7-b44d-b58fa0e99cab (0)
>  5.
>     Jun 16 09:37:59 [1297] server01      attrd:     info:
>     attrd_cib_callback:        Update 12 for probe_complete: OK (0)
>  6.
>     Jun 16 09:37:59 [1297] server01      attrd:     info:
>     attrd_cib_callback:        Update 12 for
>     probe_complete[server01]=true: OK (0)
>  7.
>     Jun 16 09:37:59 [1295] server01   stonithd:   notice:
>     can_fence_host_with_device:        p_fence_server02 can not fence
>     (reboot) server02: dynamic-list
>  8.
>     Jun 16 09:37:59 [1295] server01   stonithd:     info:
>     process_remote_stonith_query:      All queries have arrived,
>     continuing (1, 1, 1, 19fdb8e0-2611-45a7-b44d-b58fa0e99cab)
>  9.
>     Jun 16 09:37:59 [1295] server01   stonithd:   notice:
>     stonith_choose_peer:       Couldn't find anyone to fence server02
>     with <any>
> 10.
>     Jun 16 09:37:59 [1295] server01   stonithd:     info:
>     call_remote_stonith:       Total remote op timeout set to 60 for
>     fencing of node server02 for crmd.1299.19fdb8e0
> 11.
>     Jun 16 09:37:59 [1295] server01   stonithd:     info:
>     call_remote_stonith:       None of the 1 peers have devices capable
>     of terminating server02 for crmd.1299 (0)
> 12.
>     Jun 16 09:37:59 [1295] server01   stonithd:  warning:
>     get_xpath_object:  No match for //@st_delegate in /st-reply
> 13.
>     Jun 16 09:37:59 [1295] server01   stonithd:    error:
>     remote_op_done:    Operation reboot of server02 by server01 for
>     crmd.1299 at server01.19fdb8e0: No such device
> 14.
>     Jun 16 09:37:59 [1299] server01       crmd:   notice:
>     tengine_stonith_callback:  Stonith operation
>     3/12:1:0:a989fb7b-1af1-4bac-992b-eef416e25775: No such device (-19)
> 15.
>     Jun 16 09:37:59 [1299] server01       crmd:   notice:
>     tengine_stonith_callback:  Stonith operation 3 for server02 failed
>     (No such device): aborting transition.
> 16.
>     Jun 16 09:37:59 [1299] server01       crmd:   notice:
>     abort_transition_graph:    Transition aborted: Stonith failed
>     (source=tengine_stonith_callback:697, 0)
> 17.
>     Jun 16 09:37:59 [1299] server01       crmd:   notice:
>     tengine_stonith_notify:    Peer server02 was not terminated (reboot)
>     by server01 for server01: No such device
>     (ref=19fdb8e0-2611-45a7-b44d-b58fa0e99cab) by client crmd.1299
> 
> 
> So, I had to put *pcmk_host_list *parameter, like:
> 
> primitive fence_server01 stonith:fence_virsh \
>         params ipaddr=virtnode01 port=server01.fqdn action=reboot
> login=root passwd=passwd delay=15 pcmk_host_list=server01 \
>         op monitor interval=60s 
> primitive fence_server02 stonith:fence_virsh \
>         params ipaddr=virtnode02 port=server02.fqdn action=reboot
> login=root passwd=passwd delay=15 pcmk_host_list=server02 \
>         op monitor interval=60s
> 
> Could you explain me, why? I hope that this doesn't not sound rough,
> it's only I don't understand why.
> 
> Thank you very much
> Oscar Salvador

Don't use 'delay="15"' on both nodes! It's means to give one node a
head-start over the other to help avoid a 'dual fence'. The node that
has the delay will live while the node without a delay will die in a
case where communications fails and both nodes try to fence the other at
the same time.

Say you have 'delay="15"' on 'server01'; Both start to fence, server01
looks up how to fence server02, sees no delay and immediately fences.
Meanwhile, 'server02' looks up how to fence 'server01', sees a delay and
pauses. If server01 was really dead, after 15 seconds, it would proceed
with the fence action. However, if server01 is alive, server02 will die
long before it's pause expires.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?