[ClusterLabs] Antw: [EXT] Re: "Error: unable to fence '001db02a'" but It got fenced anyway

Eric Robinson eric.robinson at psmnv.com
Tue Mar 2 13:26:03 EST 2021


> -----Original Message-----
> From: Users <users-bounces at clusterlabs.org> On Behalf Of Digimer
> Sent: Monday, March 1, 2021 11:02 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> <users at clusterlabs.org>; Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
> Subject: Re: [ClusterLabs] Antw: [EXT] Re: "Error: unable to fence '001db02a'"
> but It got fenced anyway
>
> On 2021-03-01 2:50 a.m., Ulrich Windl wrote:
> >>>> Valentin Vidic <vvidic at valentin-vidic.from.hr> schrieb am
> >>>> 28.02.2021 um
> > 16:59
> > in Nachricht <20210228155921.GM29617 at valentin-vidic.from.hr>:
> >> On Sun, Feb 28, 2021 at 03:34:20PM +0000, Eric Robinson wrote:
> >>> 001db02b rebooted. After it came back up, I tried it in the other
> > direction.
> >>>
> >>> On node 001db02b, the command...
> >>>
> >>> # pcs stonith fence 001db02a
> >>>
> >>> ...produced output...
> >>>
> >>> Error: unable to fence '001db02a'.
> >>>
> >>> However, node 001db02a did get restarted!
> >>>
> >>> We also saw this error...
> >>>
> >>> Failed Actions:
> >>> * stonith‑001db02ab_start_0 on 001db02a 'unknown error' (1):
> >>> call=70,
> >> status=Timed Out, exitreason='',
> >>>     last‑rc‑change='Sun Feb 28 10:11:10 2021', queued=0ms,
> >>> exec=20014ms
> >>>
> >>> When that happens, does Pacemaker take over the other node's
> >>> resources, or
> >
> >> not?
> >>
> >> Cloud fencing usually requires a higher timeout (20s reported here).
> >>
> >> Microsoft seems to suggest the following setup:
> >>
> >> # pcs property set stonith‑timeout=900
> >
> > But doesn't that mean the other node waits 15 minutes after stonith
> > until it performs the first post-stonith action?
>
> No, it means that if there is no reply by then, the fence has failed. If the
> fence happens sooner, and the caller is told this, recovery begins very shortly
> after.
>

Interesting. Since users often report application failure within 1-3 minutes and may engineers begin investigating immediately, a technician could end up connecting to a cluster node after the stonith command was called, and could conceivably bring a failed no back up manually, only to have Azure finally get around to shooting it in the head. I don't suppose there's a way to abort/cancel a STONITH operation that is in progress?

> >> # pcs stonith create rsc_st_azure fence_azure_arm username="login ID"
> >>   password="password" resourceGroup="resource group"
> tenantId="tenant ID"
> >>   subscriptionId="subscription id"
> >>
> >
> pcmk_host_map="prod‑cl1‑0:prod‑cl1‑0‑vm‑name;prod‑cl1‑1:prod‑cl1‑1‑vm‑
> name"
> >>   power_timeout=240 pcmk_reboot_timeout=900
> pcmk_monitor_timeout=120
> >>   pcmk_monitor_retries=4 pcmk_action_limit=3
> >>   op monitor interval=3600
> >>
> >>
> > https://docs.microsoft.com/en‑us/azure/virtual‑machines/workloads/sap/
> > high‑avai
> >
> >> lability‑guide‑rhel‑pacemaker
> >>
> >> ‑‑
> >> Valentin
> >> _______________________________________________
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >
> >
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com/w/ "I am, somehow, less
> interested in the weight and convolutions of Einstein’s brain than in the near
> certainty that people of equal talent have lived and died in cotton fields and
> sweatshops." - Stephen Jay Gould
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.


More information about the Users mailing list