[ClusterLabs] Antw: How to clean up failed fencing action?
Klaus Wenninger
kwenning at redhat.com
Mon Aug 5 09:23:00 EDT 2019
On 8/5/19 3:00 PM, Ulrich Windl wrote:
>>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 03.08.2019 um 18:17 in
> Nachricht <35a226a8-115b-4dc0-f505-dbd78cdd748b at gmail.com>:
>> I'm using sbd watchdog and stonith‑watchdog‑timeout without explicit
>> stonith agents (shared nothing cluster). How can I clean up failed
>> fencing action?
>>
>> Current DC: ha1 (version
>> 2.0.1+20190408.1b68da8e8‑1.3‑2.0.1+20190408.1b68da8e8) ‑ partition with
>> quorum
>> Last updated: Sat Aug 3 19:10:12 2019
>> Last change: Sat Aug 3 19:04:56 2019 by hacluster via crmd on ha1
>>
>> 2 nodes configured
>> 7 resources configured
>>
>> Online: [ ha1 ha2 ]
>>
>> Active resources:
>>
>> A (ocf::heartbeat:Dummy): Started ha1
>> B (ocf::heartbeat:Dummy): Started ha1
>> C (ocf::heartbeat:Dummy): Started ha1
>> D (ocf::heartbeat:Dummy): Started ha1
>> E (ocf::heartbeat:Dummy): Started ha1
>> F (ocf::heartbeat:Dummy): Started ha1
>>
>> Failed Fencing Actions:
>> * reboot of ha2 failed: delegate=, client=pacemaker‑controld.1910,
>> origin=ha1,
>> last‑failed='Sat Aug 3 18:54:13 2019'
>>
>> crm_resource requires resource which does not exist.
> I'd say manual reboot of ha2 should clean up the situation ;-)
> But why did fenciong fail?
Nope, at least with kind of current pacemaker-versions (both 1.1.x and
2.x.x),
fencing-history is inherited from pre-existing nodes when a node joins a
cluster.
Thus rebooting of a single node won't purge the history.
Low-level command for handling fencing-history is stonith_admin:
-H, --history=value Show last successful fencing operation for named node
(or '*' for all nodes). Optional: --timeout, --cleanup,
--quiet (show only the operation's epoch timestamp),
--verbose (show all recorded and pending operations),
--broadcast (update history from all nodes available).
Regarding high-level-tooling it is e.g. 'pcs stonith cleanup ...'
Just to be on the safe side:
You are using qdevice for quorum? (2-node cluster and watchdog-fencing
aren't gonna work without source of real quorum out of obvious reasons)
I'm just wondering how watchdog-fencing can go wrong.
It is basically just waiting for stonith-watchdog-timeout seconds to wait
till the unseen node has committed suicide.
Klaus
>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list