[ClusterLabs] OCF_TIMEOUT - Does it recover by itself?

Wed Apr 27 18:26:19 EDT 2022

You can use a meta attribute to expire failures . The attribute name is 'failure-timeout'I have used it for my fencing devices as during the night the network was quite busy.

Best Regards,Strahil Nikolov 

  On Tue, Apr 26, 2022 at 23:54, Hayden, Robert via Users<users at clusterlabs.org> wrote:   

Robert Hayden | Lead Technology Architect | Cerner Corporation | 816.201.4068 | rhayden at cerner.com | www.cerner.com

> -----Original Message-----
> From: Users <users-bounces at clusterlabs.org> On Behalf Of Ken Gaillot
> Sent: Tuesday, April 26, 2022 2:25 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> <users at clusterlabs.org>
> Subject: Re: [ClusterLabs] OCF_TIMEOUT - Does it recover by itself?
>
> On Tue, 2022-04-26 at 15:20 -0300, Salatiel Filho wrote:
> > I have a question about OCF_TIMEOUT. Some times my cluster shows me
> > this on pcs status:
> > Failed Resource Actions:
> >  * fence-server02_monitor_60000 on server01 'OCF_TIMEOUT' (198):
> > call=419, status='Timed Out', exitreason='',
> > last-rc-change='2022-04-26 14:47:32 -03:00', queued=0ms, exec=20004ms
> >
> > I can see in the same pcs status output that the fence device is
> > started, so does that mean it failed some moment in the past and now
> > it is OK? Or do I have to do something to recover it?
>
> Correct, the status shows failures that have happened in the past. The
> cluster tries to recover failed resources automatically according to
> whatever policy has been configured (the default being to stop and
> start the resource).
>
> Since the resource is shown as active, there's nothing you have to do.
> You can investigate the timeout (for example look at the system logs
> around that timestamp to see if anything else unusual was reported),
> and you can clear the failure from the status display with
> "crm_resource --cleanup" (or "pcs resource cleanup").
>

FYI - I have had some issues with "pcs resource cleanup" and on past events where it decided
restart my already recovered and running resources throwing me into another
short outage.  Also seen past, but recovered failures cause issues with future
events where nodes are coming out of maintenance mode (times when the cluster
is reviewing states of resources and see a past failure, but not recognizing it was already
recovered).  This was mainly on RHEL/OL 7 clusters.

Since people don't like to see failures on the "pcs status" output, I have moved
to using the following to automatically clear resource failures after 1 week's time.

pcs resource defaults failure-timeout=604800

Gives people a chance to investigate a past failure, but they fall off the cluster's radar.

> >
> > # pcs status
> > Cluster name: cluster1
> > Cluster Summary:
> >  * Stack: corosync
> >  * Current DC: server02 (version 2.1.0-8.el8-7c3f660707) - partition
> > with quorum
> >  * Last updated: Tue Apr 26 14:52:56 2022
> >  * Last change:  Tue Apr 26 14:37:22 2022 by hacluster via crmd on
> > server01
> >  * 2 nodes configured
> >  * 11 resource instances configured
> >
> > Node List:
> >  * Online: [ server01 server02 ]
> >
> > Full List of Resources:
> >  * fence-server01    (stonith:fence_vmware_rest):    Started
> > server02
> >  * fence-server02    (stonith:fence_vmware_rest):    Started
> > server01
> > ...
> >
> > Is "pcs resource cleanup" the right way to remove those messages ?
> >
> >
> >
> >
> > Atenciosamente/Kind regards,
> > Salatiel
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Manage your subscription:
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.
> clusterlabs.org%2Fmailman%2Flistinfo%2Fusers&data=05%7C01%7Crha
> yden%40cerner.com%7C96253b7f767848073dcb08da27ba6e9b%7Cfbc493a80
> d244454a815f4ca58e8c09d%7C0%7C0%7C637865978923341094%7CUnknown
> %7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1ha
> WwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=EPiNm1sfkHccbXEa
> 14EmuIot5jWii53Nk5KtrdKQk9Y%3D&reserved=0
>
> ClusterLabs home:
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
> w.clusterlabs.org%2F&data=05%7C01%7Crhayden%40cerner.com%7C9
> 6253b7f767848073dcb08da27ba6e9b%7Cfbc493a80d244454a815f4ca58e8c09d
> %7C0%7C0%7C637865978923341094%7CUnknown%7CTWFpbGZsb3d8eyJWIj
> oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3
> 000%7C%7C%7C&sdata=yH1vGXlaWOfuu3q0aTxDfuonpC2XFzbwYpz7ea
> UrwzA%3D&reserved=0

CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20220427/72b74a33/attachment.htm>