[ClusterLabs] Antw: Re: Antw: RE: Antw: [EXT] Re: "Error: unable to fence '001db02a'" but It got fenced anyway

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Fri Mar 5 02:14:49 EST 2021


>>> Digimer <lists at alteeve.ca> schrieb am 04.03.2021 um 06:35 in Nachricht
<ce63f17f-b07e-5dee-cdad-8f3feaa857ff at alteeve.ca>:
> On 2021-03-03 1:56 a.m., Ulrich Windl wrote:
>>>>> Eric Robinson <eric.robinson at psmnv.com> schrieb am 02.03.2021 um 19:26
in
>> Nachricht
>> 
><SA2PR03MB58847E37845FC6C92BC3007EFA999 at SA2PR03MB5884.namprd03.prod.outlook.co

> m>
>> 
>>>>  -----Original Message-----
>>>> From: Users <users-bounces at clusterlabs.org> On Behalf Of Digimer
>>>> Sent: Monday, March 1, 2021 11:02 AM
>>>> To: Cluster Labs - All topics related to open-source clustering welcomed
>>>> <users at clusterlabs.org>; Ulrich Windl
<Ulrich.Windl at rz.uni-regensburg.de>
>>>> Subject: Re: [ClusterLabs] Antw: [EXT] Re: "Error: unable to fence 
>>> '001db02a'"
>> ...
>>>>>> Cloud fencing usually requires a higher timeout (20s reported here).
>>>>>>
>>>>>> Microsoft seems to suggest the following setup:
>>>>>>
>>>>>> # pcs property set stonith‑timeout=900
>>>>>
>>>>> But doesn't that mean the other node waits 15 minutes after stonith
>>>>> until it performs the first post-stonith action?
>>>>
>>>> No, it means that if there is no reply by then, the fence has failed. If
>> the
>>>> fence happens sooner, and the caller is told this, recovery begins very 
>>> shortly
>>>> after.
>> 
>> How would the fencing be confirmed? I don't know.
> 
> It's part of the FenceAgentAPI. The cluster invokes the fence agent,
> passes in variable=value pairs on STDIN, and waits for the agent to
> exit. It reads the agent's exit code and uses that to determine success
> or failure.

But the agent "acting remote" cannot be sure the "remote end" was killed,
specifically when the network connection seems dead.
I see that in the IPMI case you have a separate connection allowing
"out-of-band signaling", but in the general case that would not be possible.

> 
> So if the fence agent is invoked and 5 seconds later, it exits with the
> "success" RC, the cluster knows the peer is gone and that it can now
> safely begin recovery.
> 
> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.com/w/ 
> "I am, somehow, less interested in the weight and convolutions of
> Einstein’s brain than in the near certainty that people of equal talent
> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould





More information about the Users mailing list