[ClusterLabs] Antw: RE: Antw: [EXT] Re: "Error: unable to fence '001db02a'" but It got fenced anyway
Digimer
lists at alteeve.ca
Thu Mar 4 00:35:30 EST 2021
On 2021-03-03 1:56 a.m., Ulrich Windl wrote:
>>>> Eric Robinson <eric.robinson at psmnv.com> schrieb am 02.03.2021 um 19:26 in
> Nachricht
> <SA2PR03MB58847E37845FC6C92BC3007EFA999 at SA2PR03MB5884.namprd03.prod.outlook.com>
>
>>> -----Original Message-----
>>> From: Users <users-bounces at clusterlabs.org> On Behalf Of Digimer
>>> Sent: Monday, March 1, 2021 11:02 AM
>>> To: Cluster Labs - All topics related to open-source clustering welcomed
>>> <users at clusterlabs.org>; Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
>>> Subject: Re: [ClusterLabs] Antw: [EXT] Re: "Error: unable to fence
>> '001db02a'"
> ...
>>>>> Cloud fencing usually requires a higher timeout (20s reported here).
>>>>>
>>>>> Microsoft seems to suggest the following setup:
>>>>>
>>>>> # pcs property set stonith‑timeout=900
>>>>
>>>> But doesn't that mean the other node waits 15 minutes after stonith
>>>> until it performs the first post-stonith action?
>>>
>>> No, it means that if there is no reply by then, the fence has failed. If
> the
>>> fence happens sooner, and the caller is told this, recovery begins very
>> shortly
>>> after.
>
> How would the fencing be confirmed? I don't know.
It's part of the FenceAgentAPI. The cluster invokes the fence agent,
passes in variable=value pairs on STDIN, and waits for the agent to
exit. It reads the agent's exit code and uses that to determine success
or failure.
So if the fence agent is invoked and 5 seconds later, it exits with the
"success" RC, the cluster knows the peer is gone and that it can now
safely begin recovery.
--
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
More information about the Users
mailing list