[ClusterLabs] clear pending fence operation

Tue Jul 7 08:03:26 UTC 2015

On Tue, Jul 7, 2015 at 10:41 AM,  <philipp.achmueller at arz.at> wrote:
> hi,
>
> is there any way to clear/remove pending stonith operation on cluster node?
>
> after some internal testing i got following status:
>
> Jul  4 12:18:02 XXX crmd[1673]:   notice: te_fence_node: Executing reboot
> fencing operation (179) on XXX (timeout=60000)
> Jul  4 12:18:02 XXX stonith-ng[1668]:   notice: handle_request: Client
> crmd.1673.1867d504 wants to fence (reboot) 'XXX' with device '(any)'
> Jul  4 12:18:02 XXX stonith-ng[1668]:   notice: initiate_remote_stonith_op:
> Initiating remote operation reboot for XXX:
> 3453b93d-a13a-4513-b05b-b79ad85ff992 (0)
> Jul  4 12:18:03 XXX stonith-ng[1668]:    error: remote_op_done: Operation
> reboot of XXX by <no-one> for crmd.1673 at XXX.3453b93d: Generic Pacemaker
> error
> Jul  4 12:18:03 XXX crmd[1673]:   notice: tengine_stonith_callback: Stonith
> operation 2/179:23875:0:134436dd-4df8-44a2-bf4a-ec6276883edd: Generic
> Pacemaker error (-201)
> Jul  4 12:18:03 XXX crmd[1673]:   notice: tengine_stonith_callback: Stonith
> operation 2 forXXX failed (Generic Pacemaker error): aborting transition.
> Jul  4 12:18:03 XXX crmd[1673]:   notice: abort_transition_graph: Transition
> aborted: Stonith failed (source=tengine_stonith_callback:697, 0)
> Jul  4 12:18:03 XXX crmd[1673]:   notice: tengine_stonith_notify: Peer XXX
> was not terminated (reboot) by <anyone> for XXX: Generic Pacemaker error
> (ref=3453b93d-a13a-4513-b05b-b79ad85ff992) by client crmd.1673
>
> so, node XXX is still online, i want to get cluster back to stable
>

If you are using sufficiently recent pacemaker, you can use
"stonith_admin --confirm"; be sure to actually stop all resources on
victim node in this case.

On older pacemaker using crmsh "crm node clearstate" does the same.