[ClusterLabs] Antw: [EXT] Re: OCF_TIMEOUT ‑ Does it recover by itself?

Wed Apr 27 02:49:53 EDT 2022

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 26.04.2022 um 21:24 in
Nachricht
<ebf9500a0af6fab1153d25c8859b80bd287f3e4c.camel at redhat.com>:
> On Tue, 2022‑04‑26 at 15:20 ‑0300, Salatiel Filho wrote:
>> I have a question about OCF_TIMEOUT. Some times my cluster shows me
>> this on pcs status:
>> Failed Resource Actions:
>>   * fence‑server02_monitor_60000 on server01 'OCF_TIMEOUT' (198):
>> call=419, status='Timed Out', exitreason='',
>> last‑rc‑change='2022‑04‑26 14:47:32 ‑03:00', queued=0ms, exec=20004ms
>> 
>> I can see in the same pcs status output that the fence device is
>> started, so does that mean it failed some moment in the past and now
>> it is OK? Or do I have to do something to recover it?
> 
> Correct, the status shows failures that have happened in the past. The

However the "past" was rather recently ;-)

> cluster tries to recover failed resources automatically according to
> whatever policy has been configured (the default being to stop and
> start the resource).

AFAIR the cluster stops monitoring after that, and you have to cleanup the
error first.
Am I wrong?

> 
> Since the resource is shown as active, there's nothing you have to do.
> You can investigate the timeout (for example look at the system logs
> around that timestamp to see if anything else unusual was reported),
> and you can clear the failure from the status display with
> "crm_resource ‑‑cleanup" (or "pcs resource cleanup").

20 seconds can be rather short for some monitors on a busy system.
Maybe you suffer from "read stalls" (when a lot of dirty buffers are
written)?

You could use the classic "sa/sar" tools to monitor your system, or if you
have some specific suspect you might use monit to check.
For example I'm monitoring the /var filesystem here in a VM:

# monit status fs_var
Monit 5.29.0 uptime: 6h 20m

Filesystem 'fs_var'
  status                       OK
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  filesystem type              ext3
  filesystem flags             rw,relatime,data=ordered
  permission                   755
  uid                          0
  gid                          0
  block size                   4 kB
  space total                  5.5 GB (of which 10.9% is reserved for root
user)
  space free for non superuser 2.8 GB [51.4%]
  space free total             3.4 GB [62.3%]
  inodes total                 786432
  inodes free                  781794 [99.4%]
  read bytes                   34.1 B/s [113.3 MB total]
  disk read operations         0.0 reads/s [4269 reads total]
  write bytes                  4.2 kB/s [75.5 MB total]
  disk write operations        1.0 writes/s [15037 writes total]
  service time                 0.007ms/operation (of which read 0.000ms, write
0.007ms)
  data collected               Wed, 27 Apr 2022 08:46:17

(You can trigger alerts if any of those values exceeds some threshold)

Regards,
Ulrich

> 
>> 
>> # pcs status
>> Cluster name: cluster1
>> Cluster Summary:
>>   * Stack: corosync
>>   * Current DC: server02 (version 2.1.0‑8.el8‑7c3f660707) ‑ partition
>> with quorum
>>   * Last updated: Tue Apr 26 14:52:56 2022
>>   * Last change:  Tue Apr 26 14:37:22 2022 by hacluster via crmd on
>> server01
>>   * 2 nodes configured
>>   * 11 resource instances configured
>> 
>> Node List:
>>   * Online: [ server01 server02 ]
>> 
>> Full List of Resources:
>>   * fence‑server01    (stonith:fence_vmware_rest):     Started
>> server02
>>   * fence‑server02    (stonith:fence_vmware_rest):     Started
>> server01
>> ...
>> 
>> Is "pcs resource cleanup" the right way to remove those messages ?
>> 
>> 
>> 
>> 
>> Atenciosamente/Kind regards,
>> Salatiel
> ‑‑ 
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/