[ClusterLabs] Antw: Re: [EXT] Problem with DLM

Thu Jul 28 05:28:24 EDT 2022

>>> "Lentes, Bernd" <bernd.lentes at helmholtz-muenchen.de> schrieb am 26.07.2022 um
21:36 in Nachricht
<1994685463.141245271.1658864186207.JavaMail.zimbra at helmholtz-muenchen.de>:

> 
> ----- On 26 Jul, 2022, at 20:06, Ulrich Windl 
> Ulrich.Windl at rz.uni-regensburg.de wrote:
> 
>> Hi Bernd!
>> 
>> I think the answer may be some time before the timeout was reported; maybe a
>> network issue? Or a very high load. It's hard to say from the logs...
> 
> Yes, i had a high load before:
> Jul 20 00:17:42 [32512] ha-idg-1       crmd:   notice: 
> throttle_check_thresholds:       High CPU load detected: 90.080002
> Jul 20 00:18:12 [32512] ha-idg-1       crmd:   notice: 
> throttle_check_thresholds:       High CPU load detected: 76.169998
> Jul 20 00:18:42 [32512] ha-idg-1       crmd:   notice: 
> throttle_check_thresholds:       High CPU load detected: 85.629997
> Jul 20 00:19:12 [32512] ha-idg-1       crmd:   notice: 
> throttle_check_thresholds:       High CPU load detected: 70.660004
> Jul 20 00:19:42 [32512] ha-idg-1       crmd:   notice: 
> throttle_check_thresholds:       High CPU load detected: 58.340000
> Jul 20 00:20:12 [32512] ha-idg-1       crmd:     info: 
> throttle_check_thresholds:       Moderate CPU load detected: 48.740002
> Jul 20 00:20:12 [32512] ha-idg-1       crmd:     info: 
> throttle_send_command:   New throttle mode: 0010 (was 0100)
> Jul 20 00:20:42 [32512] ha-idg-1       crmd:     info: 
> throttle_check_thresholds:       Moderate CPU load detected: 41.889999
> Jul 20 00:21:12 [32512] ha-idg-1       crmd:     info: 
> throttle_send_command:   New throttle mode: 0001 (was 0010)
> Jul 20 00:21:56 [12204] ha-idg-1       lrmd:  warning: 
> child_timeout_callback:  dlm_monitor_30000 process (PID 11816) timed out
> Jul 20 00:21:56 [12204] ha-idg-1       lrmd:  warning: operation_finished:   
>    dlm_monitor_30000:11816 - timed out after 20000ms
> Jul 20 00:21:56 [32512] ha-idg-1       crmd:    error: process_lrm_event:    
>    Result of monitor operation for dlm on ha-idg-1: Timed Out | call=1255 
> key=dlm_monitor_30000 timeout=20000ms
> Jul 20 00:21:56 [32512] ha-idg-1       crmd:     info: exec_alert_list: 
> Sending resource alert via smtp_alert to informatic.idg at helmholtz-muenchen.de 
> Jul 20 00:21:56 [12204] ha-idg-1       lrmd:     info: 
> process_lrmd_alert_exec: Executing alert smtp_alert for 
> 8f934e90-12f5-4bad-b4f4-55ac933f01c6
> 
> Can that interfere with DLM ?

It depends ;-)
If the CPU load is mostly user load, then (also depending on the number of CPUs you have) proably not, but if the load is I/O or system load, it could affect any pacemaker process in a bad way. I think you'll have to analyze your load; maybe adjusting timeouts.

You could use monit to examine your system load (this is just some idle VM):
  status                       OK
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  load average                 [0.00] [0.00] [0.00]
  cpu                          0.2%usr 0.1%sys 0.0%nice 0.0%iowait 0.0%hardirq 0.0%softirq 0.0%steal 0.0%guest 0.0%guestnice
  memory usage                 442.1 MB [22.3%]
  swap usage                   20.5 MB [1.0%]
  uptime                       13d 17h 41m
  boot time                    Thu, 14 Jul 2022 17:40:58
  filedescriptors              1376 [0.7% of 198048 limit]
  data collected               Thu, 28 Jul 2022 11:20:41

You could configurer action scripts like this:
    if loadavg (1min) per core > 4 then exec "/var/lib/monit/log-top.sh"
    if loadavg (5min) per core > 2 then exec "/var/lib/monit/log-top.sh"
    if loadavg (15min) per core > 1 then exec "/var/lib/monit/log-top.sh"
    if memory usage > 90% for 2 cycles then exec "/var/lib/monit/log-top.sh"
    if swap usage > 25% for 2 cycles then exec "/var/lib/monit/log-top.sh"
    if swap usage > 50% then exec "/var/lib/monit/log-top.sh"
    if cpu usage (system) > 20% for 3 cycles then exec "/var/lib/monit/log-top.sh"
    if cpu usage (wait) > 80% then exec "/var/lib/monit/log-top.sh"

A possible script could be (this mess created by< myself):
#!/bin/sh
sect()
{
    echo "--- $1 ---"
    shift
    eval "$@"
}

{
    echo "========== $(/bin/date) =========="
    sect 'MONIT env' 'env | grep ^MONIT_'
    sect 'mpstat' /usr/bin/mpstat
    sect 'vmstat' /usr/bin/vmstat
    sect 'top' /usr/bin/top -b -n 1 -Hi
} >> /var/log/monit/top.log

Regards,
Ulrich

> 
> Bernd