[ClusterLabs] Antw: Re: Antw: RE: Antw: [EXT] Re: "Error: unable to fence '001db02a'" but It got fenced anyway

Fri Mar 5 11:06:30 EST 2021

On 3/5/21 8:14 AM, Ulrich Windl wrote:
>>>> Digimer <lists at alteeve.ca> schrieb am 04.03.2021 um 06:35 in Nachricht
> <ce63f17f-b07e-5dee-cdad-8f3feaa857ff at alteeve.ca>:
>> On 2021-03-03 1:56 a.m., Ulrich Windl wrote:
>>>>>> Eric Robinson <eric.robinson at psmnv.com> schrieb am 02.03.2021 um 19:26
> in
>>> Nachricht
>>>
>> <SA2PR03MB58847E37845FC6C92BC3007EFA999 at SA2PR03MB5884.namprd03.prod.outlook.co
>> m>
>>>>>  -----Original Message-----
>>>>> From: Users <users-bounces at clusterlabs.org> On Behalf Of Digimer
>>>>> Sent: Monday, March 1, 2021 11:02 AM
>>>>> To: Cluster Labs - All topics related to open-source clustering welcomed
>>>>> <users at clusterlabs.org>; Ulrich Windl
> <Ulrich.Windl at rz.uni-regensburg.de>
>>>>> Subject: Re: [ClusterLabs] Antw: [EXT] Re: "Error: unable to fence 
>>>> '001db02a'"
>>> ...
>>>>>>> Cloud fencing usually requires a higher timeout (20s reported here).
>>>>>>>
>>>>>>> Microsoft seems to suggest the following setup:
>>>>>>>
>>>>>>> # pcs property set stonith‑timeout=900
>>>>>> But doesn't that mean the other node waits 15 minutes after stonith
>>>>>> until it performs the first post-stonith action?
>>>>> No, it means that if there is no reply by then, the fence has failed. If
>>> the
>>>>> fence happens sooner, and the caller is told this, recovery begins very 
>>>> shortly
>>>>> after.
>>> How would the fencing be confirmed? I don't know.
>> It's part of the FenceAgentAPI. The cluster invokes the fence agent,
>> passes in variable=value pairs on STDIN, and waits for the agent to
>> exit. It reads the agent's exit code and uses that to determine success
>> or failure.
> But the agent "acting remote" cannot be sure the "remote end" was killed,
> specifically when the network connection seems dead.
> I see that in the IPMI case you have a separate connection allowing
> "out-of-band signaling", but in the general case that would not be possible.
Fence-agents are expected to be implemented in a way that a positive return
of a fence-action implies verification on the "remote end".
If you don't have these "out-of-band signaling" channels and still want to
reschedule resources if network is dropped somewhere, only thing left is
SBD (watchdog-fencing - with poison-pill/shared-disk you would be using
the communication with the shared disk as this kind of
"out-of-band signaling") - if you want to stay with a single cluster -
or booth if you can imagineto go with multiple clusters.
>> So if the fence agent is invoked and 5 seconds later, it exits with the
>> "success" RC, the cluster knows the peer is gone and that it can now
>> safely begin recovery.
>>
>>
>> -- 
>> Digimer
>> Papers and Projects: https://alteeve.com/w/ 
>> "I am, somehow, less interested in the weight and convolutions of
>> Einstein’s brain than in the near certainty that people of equal talent
>> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/