[ClusterLabs] Antw: [EXT] Re: OCF_TIMEOUT ‑ Does it recover by itself?

Ken Gaillot kgaillot at redhat.com
Wed Apr 27 09:58:38 EDT 2022


On Wed, 2022-04-27 at 08:49 +0200, Ulrich Windl wrote:
> > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 26.04.2022 um
> > > > 21:24 in
> Nachricht
> <ebf9500a0af6fab1153d25c8859b80bd287f3e4c.camel at redhat.com>:
> > On Tue, 2022‑04‑26 at 15:20 ‑0300, Salatiel Filho wrote:
> > > I have a question about OCF_TIMEOUT. Some times my cluster shows
> > > me
> > > this on pcs status:
> > > Failed Resource Actions:
> > >   * fence‑server02_monitor_60000 on server01 'OCF_TIMEOUT' (198):
> > > call=419, status='Timed Out', exitreason='',
> > > last‑rc‑change='2022‑04‑26 14:47:32 ‑03:00', queued=0ms,
> > > exec=20004ms
> > > 
> > > I can see in the same pcs status output that the fence device is
> > > started, so does that mean it failed some moment in the past and
> > > now
> > > it is OK? Or do I have to do something to recover it?
> > 
> > Correct, the status shows failures that have happened in the past.
> > The
> 
> However the "past" was rather recently ;-)
> 
> > cluster tries to recover failed resources automatically according
> > to
> > whatever policy has been configured (the default being to stop and
> > start the resource).
> 
> AFAIR the cluster stops monitoring after that, and you have to
> cleanup the
> error first.

Nope, the monitor keeps running. Monitors are only stopped for
maintenance mode or node standby.

Of course if the resource moves to another node, the monitor will stop
on the original node (unless you specifically configure a monitor for
the Stopped role).

> Am I wrong?
> 
> > Since the resource is shown as active, there's nothing you have to
> > do.
> > You can investigate the timeout (for example look at the system
> > logs
> > around that timestamp to see if anything else unusual was
> > reported),
> > and you can clear the failure from the status display with
> > "crm_resource ‑‑cleanup" (or "pcs resource cleanup").
> 
> 20 seconds can be rather short for some monitors on a busy system.
> Maybe you suffer from "read stalls" (when a lot of dirty buffers are
> written)?
> 
> You could use the classic "sa/sar" tools to monitor your system, or
> if you
> have some specific suspect you might use monit to check.
> For example I'm monitoring the /var filesystem here in a VM:
> 
> # monit status fs_var
> Monit 5.29.0 uptime: 6h 20m
> 
> Filesystem 'fs_var'
>   status                       OK
>   monitoring status            Monitored
>   monitoring mode              active
>   on reboot                    start
>   filesystem type              ext3
>   filesystem flags             rw,relatime,data=ordered
>   permission                   755
>   uid                          0
>   gid                          0
>   block size                   4 kB
>   space total                  5.5 GB (of which 10.9% is reserved for
> root
> user)
>   space free for non superuser 2.8 GB [51.4%]
>   space free total             3.4 GB [62.3%]
>   inodes total                 786432
>   inodes free                  781794 [99.4%]
>   read bytes                   34.1 B/s [113.3 MB total]
>   disk read operations         0.0 reads/s [4269 reads total]
>   write bytes                  4.2 kB/s [75.5 MB total]
>   disk write operations        1.0 writes/s [15037 writes total]
>   service time                 0.007ms/operation (of which read
> 0.000ms, write
> 0.007ms)
>   data collected               Wed, 27 Apr 2022 08:46:17
> 
> (You can trigger alerts if any of those values exceeds some
> threshold)
> 
> Regards,
> Ulrich
> 
> 
> > > # pcs status
> > > Cluster name: cluster1
> > > Cluster Summary:
> > >   * Stack: corosync
> > >   * Current DC: server02 (version 2.1.0‑8.el8‑7c3f660707) ‑
> > > partition
> > > with quorum
> > >   * Last updated: Tue Apr 26 14:52:56 2022
> > >   * Last change:  Tue Apr 26 14:37:22 2022 by hacluster via crmd
> > > on
> > > server01
> > >   * 2 nodes configured
> > >   * 11 resource instances configured
> > > 
> > > Node List:
> > >   * Online: [ server01 server02 ]
> > > 
> > > Full List of Resources:
> > >   * fence‑server01    (stonith:fence_vmware_rest):     Started
> > > server02
> > >   * fence‑server02    (stonith:fence_vmware_rest):     Started
> > > server01
> > > ...
> > > 
> > > Is "pcs resource cleanup" the right way to remove those messages
> > > ?
> > > 
> > > 
> > > 
> > > 
> > > Atenciosamente/Kind regards,
> > > Salatiel
> > ‑‑ 
> > Ken Gaillot <kgaillot at redhat.com>
> > 
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list