[ClusterLabs] Antw: [EXT] Re: OCF_TIMEOUT ‑ Does it recover by itself?
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Wed Apr 27 02:49:53 EDT 2022
>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 26.04.2022 um 21:24 in
Nachricht
<ebf9500a0af6fab1153d25c8859b80bd287f3e4c.camel at redhat.com>:
> On Tue, 2022‑04‑26 at 15:20 ‑0300, Salatiel Filho wrote:
>> I have a question about OCF_TIMEOUT. Some times my cluster shows me
>> this on pcs status:
>> Failed Resource Actions:
>> * fence‑server02_monitor_60000 on server01 'OCF_TIMEOUT' (198):
>> call=419, status='Timed Out', exitreason='',
>> last‑rc‑change='2022‑04‑26 14:47:32 ‑03:00', queued=0ms, exec=20004ms
>>
>> I can see in the same pcs status output that the fence device is
>> started, so does that mean it failed some moment in the past and now
>> it is OK? Or do I have to do something to recover it?
>
> Correct, the status shows failures that have happened in the past. The
However the "past" was rather recently ;-)
> cluster tries to recover failed resources automatically according to
> whatever policy has been configured (the default being to stop and
> start the resource).
AFAIR the cluster stops monitoring after that, and you have to cleanup the
error first.
Am I wrong?
>
> Since the resource is shown as active, there's nothing you have to do.
> You can investigate the timeout (for example look at the system logs
> around that timestamp to see if anything else unusual was reported),
> and you can clear the failure from the status display with
> "crm_resource ‑‑cleanup" (or "pcs resource cleanup").
20 seconds can be rather short for some monitors on a busy system.
Maybe you suffer from "read stalls" (when a lot of dirty buffers are
written)?
You could use the classic "sa/sar" tools to monitor your system, or if you
have some specific suspect you might use monit to check.
For example I'm monitoring the /var filesystem here in a VM:
# monit status fs_var
Monit 5.29.0 uptime: 6h 20m
Filesystem 'fs_var'
status OK
monitoring status Monitored
monitoring mode active
on reboot start
filesystem type ext3
filesystem flags rw,relatime,data=ordered
permission 755
uid 0
gid 0
block size 4 kB
space total 5.5 GB (of which 10.9% is reserved for root
user)
space free for non superuser 2.8 GB [51.4%]
space free total 3.4 GB [62.3%]
inodes total 786432
inodes free 781794 [99.4%]
read bytes 34.1 B/s [113.3 MB total]
disk read operations 0.0 reads/s [4269 reads total]
write bytes 4.2 kB/s [75.5 MB total]
disk write operations 1.0 writes/s [15037 writes total]
service time 0.007ms/operation (of which read 0.000ms, write
0.007ms)
data collected Wed, 27 Apr 2022 08:46:17
(You can trigger alerts if any of those values exceeds some threshold)
Regards,
Ulrich
>
>>
>> # pcs status
>> Cluster name: cluster1
>> Cluster Summary:
>> * Stack: corosync
>> * Current DC: server02 (version 2.1.0‑8.el8‑7c3f660707) ‑ partition
>> with quorum
>> * Last updated: Tue Apr 26 14:52:56 2022
>> * Last change: Tue Apr 26 14:37:22 2022 by hacluster via crmd on
>> server01
>> * 2 nodes configured
>> * 11 resource instances configured
>>
>> Node List:
>> * Online: [ server01 server02 ]
>>
>> Full List of Resources:
>> * fence‑server01 (stonith:fence_vmware_rest): Started
>> server02
>> * fence‑server02 (stonith:fence_vmware_rest): Started
>> server01
>> ...
>>
>> Is "pcs resource cleanup" the right way to remove those messages ?
>>
>>
>>
>>
>> Atenciosamente/Kind regards,
>> Salatiel
> ‑‑
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list