[ClusterLabs] Antw: How to clean up failed fencing action?

Mon Aug 5 09:23:00 EDT 2019

On 8/5/19 3:00 PM, Ulrich Windl wrote:
>>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 03.08.2019 um 18:17 in
> Nachricht <35a226a8-115b-4dc0-f505-dbd78cdd748b at gmail.com>:
>> I'm using sbd watchdog and stonith‑watchdog‑timeout without explicit
>> stonith agents (shared nothing cluster). How can I clean up failed
>> fencing action?
>>
>> Current DC: ha1 (version
>> 2.0.1+20190408.1b68da8e8‑1.3‑2.0.1+20190408.1b68da8e8) ‑ partition with
>> quorum
>> Last updated: Sat Aug  3 19:10:12 2019
>> Last change: Sat Aug  3 19:04:56 2019 by hacluster via crmd on ha1
>>
>> 2 nodes configured
>> 7 resources configured
>>
>> Online: [ ha1 ha2 ]
>>
>> Active resources:
>>
>>  A	(ocf::heartbeat:Dummy):	Started ha1
>>  B	(ocf::heartbeat:Dummy):	Started ha1
>>  C	(ocf::heartbeat:Dummy):	Started ha1
>>  D	(ocf::heartbeat:Dummy):	Started ha1
>>  E	(ocf::heartbeat:Dummy):	Started ha1
>>  F	(ocf::heartbeat:Dummy):	Started ha1
>>
>> Failed Fencing Actions:
>> * reboot of ha2 failed: delegate=, client=pacemaker‑controld.1910,
>> origin=ha1,
>>     last‑failed='Sat Aug  3 18:54:13 2019'
>>
>> crm_resource requires resource which does not exist.
> I'd say manual reboot of ha2 should clean up the situation ;-)
> But why did fenciong fail?
Nope, at least with kind of current pacemaker-versions (both 1.1.x and
2.x.x),
fencing-history is inherited from pre-existing nodes when a node joins a
cluster.
Thus rebooting of a single node won't purge the history.

Low-level command for handling fencing-history is stonith_admin:

-H, --history=value    Show last successful fencing operation for named node
            (or '*' for all nodes). Optional: --timeout, --cleanup,
            --quiet (show only the operation's epoch timestamp),
            --verbose (show all recorded and pending operations),
            --broadcast (update history from all nodes available).

Regarding high-level-tooling it is e.g. 'pcs stonith cleanup ...'

Just to be on the safe side:
You are using qdevice for quorum? (2-node cluster and watchdog-fencing
aren't gonna work without source of real quorum out of obvious reasons)

I'm just wondering how watchdog-fencing can go wrong.
It is basically just waiting for stonith-watchdog-timeout seconds to wait
till the unseen node has committed suicide.

Klaus
>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/