[ClusterLabs] stonith - no route to host

Tue Jun 16 18:10:52 UTC 2015

2015-06-16 19:30 GMT+02:00 Digimer <lists at alteeve.ca>:

> On 16/06/15 04:18 AM, Oscar Salvador wrote:
> >
> >
> > 2015-06-16 5:59 GMT+02:00 Andrew Beekhof <andrew at beekhof.net
> > <mailto:andrew at beekhof.net>>:
> >
> >
> >     > On 16 Jun 2015, at 12:00 am, Oscar Salvador <
> osalvador.vilardaga at gmail.com
> >     <mailto:osalvador.vilardaga at gmail.com>> wrote:
> >     >
> >     > Hi,
> >     >
> >     > I've configured a fencing with libvirt, but I'm having some
> >     problem with stonith, due to the error "no route to host”
> >
> >     That message is a bit wonky.
> >     What it really means is that there were no devices that advertise
> >     the ability to fence that node.
> >
> >     In this case, pacemaker wants to fence “server” but hostlist is set
> >     to server.fqdn
> >     Drop the .fqdn and it should work
> >
> >
> > Get rid of the +fqdn was not an option, sorry, but I could fix it in
> > another way with the help of digimer.
> > I've used the fence_virsh, from fence_agents.
> >
> > First of all I configured it in this way:
> >
> > /primitive fence_server01 stonith:fence_virsh \
> > /
> > /        params ipaddr=virtnode01 port=server01.fqdn action=reboot
> > login=root passwd=passwd delay=15  \/
> > /        op monitor interval=60s /
> > /primitive fence_server02 stonith:fence_virsh \/
> > /        params ipaddr=virtnode02 port=server02.fqdn action=reboot
> > login=root passwd=passwd delay=15  \/
> > /        op monitor interval=60s /
> > /
> > /
> >
> > But when I tried to fence a node, I received this errors:
> >
> >  1.
> >     Jun 16 09:37:59 [1298] server01    pengine:  warning: pe_fence_node:
> >         Node server02 will be fenced because p_fence_server01 is thought
> >     to be active there
> >  2.
> >     Jun 16 09:37:59 [1299] server01       crmd:   notice: te_fence_node:
> >         Executing reboot fencing operation (12) on server02
> (timeout=60000)
> >  3.
> >     Jun 16 09:37:59 [1295] server01   stonithd:   notice:
> >     handle_request:    Client crmd.1299.d339ea94 wants to fence (reboot)
> >     'server02' with device '(any)'
> >  4.
> >     Jun 16 09:37:59 [1295] server01   stonithd:   notice:
> >     initiate_remote_stonith_op:        Initiating remote operation
> >     reboot for server02: 19fdb8e0-2611-45a7-b44d-b58fa0e99cab (0)
> >  5.
> >     Jun 16 09:37:59 [1297] server01      attrd:     info:
> >     attrd_cib_callback:        Update 12 for probe_complete: OK (0)
> >  6.
> >     Jun 16 09:37:59 [1297] server01      attrd:     info:
> >     attrd_cib_callback:        Update 12 for
> >     probe_complete[server01]=true: OK (0)
> >  7.
> >     Jun 16 09:37:59 [1295] server01   stonithd:   notice:
> >     can_fence_host_with_device:        p_fence_server02 can not fence
> >     (reboot) server02: dynamic-list
> >  8.
> >     Jun 16 09:37:59 [1295] server01   stonithd:     info:
> >     process_remote_stonith_query:      All queries have arrived,
> >     continuing (1, 1, 1, 19fdb8e0-2611-45a7-b44d-b58fa0e99cab)
> >  9.
> >     Jun 16 09:37:59 [1295] server01   stonithd:   notice:
> >     stonith_choose_peer:       Couldn't find anyone to fence server02
> >     with <any>
> > 10.
> >     Jun 16 09:37:59 [1295] server01   stonithd:     info:
> >     call_remote_stonith:       Total remote op timeout set to 60 for
> >     fencing of node server02 for crmd.1299.19fdb8e0
> > 11.
> >     Jun 16 09:37:59 [1295] server01   stonithd:     info:
> >     call_remote_stonith:       None of the 1 peers have devices capable
> >     of terminating server02 for crmd.1299 (0)
> > 12.
> >     Jun 16 09:37:59 [1295] server01   stonithd:  warning:
> >     get_xpath_object:  No match for //@st_delegate in /st-reply
> > 13.
> >     Jun 16 09:37:59 [1295] server01   stonithd:    error:
> >     remote_op_done:    Operation reboot of server02 by server01 for
> >     crmd.1299 at server01.19fdb8e0: No such device
> > 14.
> >     Jun 16 09:37:59 [1299] server01       crmd:   notice:
> >     tengine_stonith_callback:  Stonith operation
> >     3/12:1:0:a989fb7b-1af1-4bac-992b-eef416e25775: No such device (-19)
> > 15.
> >     Jun 16 09:37:59 [1299] server01       crmd:   notice:
> >     tengine_stonith_callback:  Stonith operation 3 for server02 failed
> >     (No such device): aborting transition.
> > 16.
> >     Jun 16 09:37:59 [1299] server01       crmd:   notice:
> >     abort_transition_graph:    Transition aborted: Stonith failed
> >     (source=tengine_stonith_callback:697, 0)
> > 17.
> >     Jun 16 09:37:59 [1299] server01       crmd:   notice:
> >     tengine_stonith_notify:    Peer server02 was not terminated (reboot)
> >     by server01 for server01: No such device
> >     (ref=19fdb8e0-2611-45a7-b44d-b58fa0e99cab) by client crmd.1299
> >
> >
> > So, I had to put *pcmk_host_list *parameter, like:
> >
> > primitive fence_server01 stonith:fence_virsh \
> >         params ipaddr=virtnode01 port=server01.fqdn action=reboot
> > login=root passwd=passwd delay=15 pcmk_host_list=server01 \
> >         op monitor interval=60s
> > primitive fence_server02 stonith:fence_virsh \
> >         params ipaddr=virtnode02 port=server02.fqdn action=reboot
> > login=root passwd=passwd delay=15 pcmk_host_list=server02 \
> >         op monitor interval=60s
> >
> > Could you explain me, why? I hope that this doesn't not sound rough,
> > it's only I don't understand why.
> >
> > Thank you very much
> > Oscar Salvador
>
> Don't use 'delay="15"' on both nodes! It's means to give one node a
> head-start over the other to help avoid a 'dual fence'. The node that
> has the delay will live while the node without a delay will die in a
> case where communications fails and both nodes try to fence the other at
> the same time.
>
> Say you have 'delay="15"' on 'server01'; Both start to fence, server01
> looks up how to fence server02, sees no delay and immediately fences.
> Meanwhile, 'server02' looks up how to fence 'server01', sees a delay and
> pauses. If server01 was really dead, after 15 seconds, it would proceed
> with the fence action. However, if server01 is alive, server02 will die
> long before it's pause expires.
>
>
Hey Digimer, I know, actually in my config I have only one "delay"
specified for this purpose. Maybe was an copy/paste error.
Thanks anyway ;)

Oscar Salvador
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150616/8357a345/attachment.htm>