[ClusterLabs] Antw: RE: Antw: [EXT] Re: "Error: unable to fence '001db02a'" but It got fenced anyway

Eric Robinson eric.robinson at psmnv.com
Wed Mar 3 18:53:18 EST 2021


> -----Original Message-----
> From: Users <users-bounces at clusterlabs.org> On Behalf Of Ulrich Windl
> Sent: Wednesday, March 3, 2021 12:57 AM
> To: users at clusterlabs.org
> Subject: [ClusterLabs] Antw: RE: Antw: [EXT] Re: "Error: unable to fence
> '001db02a'" but It got fenced anyway
>
> >>> Eric Robinson <eric.robinson at psmnv.com> schrieb am 02.03.2021 um
> >>> 19:26 in
> Nachricht
> <SA2PR03MB58847E37845FC6C92BC3007EFA999 at SA2PR03MB5884.namprd0
> 3.prod.outlook.com>
>
> >>  -----Original Message-----
> >> From: Users <users-bounces at clusterlabs.org> On Behalf Of Digimer
> >> Sent: Monday, March 1, 2021 11:02 AM
> >> To: Cluster Labs - All topics related to open-source clustering
> >> welcomed <users at clusterlabs.org>; Ulrich Windl
> >> <Ulrich.Windl at rz.uni-regensburg.de>
> >> Subject: Re: [ClusterLabs] Antw: [EXT] Re: "Error: unable to fence
> > '001db02a'"
> ...
> >> >> Cloud fencing usually requires a higher timeout (20s reported here).
> >> >>
> >> >> Microsoft seems to suggest the following setup:
> >> >>
> >> >> # pcs property set stonith‑timeout=900
> >> >
> >> > But doesn't that mean the other node waits 15 minutes after stonith
> >> > until it performs the first post-stonith action?
> >>
> >> No, it means that if there is no reply by then, the fence has failed.
> >> If
> the
> >> fence happens sooner, and the caller is told this, recovery begins
> >> very
> > shortly
> >> after.
>
> How would the fencing be confirmed? I don't know.
>
>
> >>
> >
> > Interesting. Since users often report application failure within 1-3
> > minutes
>
> > and may engineers begin investigating immediately, a technician could
> > end up
>
> > connecting to a cluster node after the stonith command was called, and
> > could
>
> > conceivably bring a failed node back up manually, only to have Azure
> > finally get around to shooting it in the head. I don't suppose there's
> > a way to abort/cancel a STONITH operation that is in progress?
>
> I think you have to decide: Let the cluster handle the problem, or let the
> admin handle the problem, but preferrably not both.
> I also think you cannot cancel a STONITH; you can only confirm it.
>
> Regards,
> Ulrich
>

Standing by and letting the cluster handle the problem is a hard pill to swallow when a technician could resolve things and bring services back up sooner, but I get your point.

Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.


More information about the Users mailing list