[ClusterLabs] [EXT] Problem with DLM

Tue Jul 26 15:47:41 EDT 2022

On Tue, Jul 26, 2022 at 12:36 PM Lentes, Bernd
<bernd.lentes at helmholtz-muenchen.de> wrote:
>
>
>
> ----- On 26 Jul, 2022, at 20:06, Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de wrote:
>
> > Hi Bernd!
> >
> > I think the answer may be some time before the timeout was reported; maybe a
> > network issue? Or a very high load. It's hard to say from the logs...
>
> Yes, i had a high load before:
> Jul 20 00:17:42 [32512] ha-idg-1       crmd:   notice: throttle_check_thresholds:       High CPU load detected: 90.080002
> Jul 20 00:18:12 [32512] ha-idg-1       crmd:   notice: throttle_check_thresholds:       High CPU load detected: 76.169998
> Jul 20 00:18:42 [32512] ha-idg-1       crmd:   notice: throttle_check_thresholds:       High CPU load detected: 85.629997
> Jul 20 00:19:12 [32512] ha-idg-1       crmd:   notice: throttle_check_thresholds:       High CPU load detected: 70.660004
> Jul 20 00:19:42 [32512] ha-idg-1       crmd:   notice: throttle_check_thresholds:       High CPU load detected: 58.340000
> Jul 20 00:20:12 [32512] ha-idg-1       crmd:     info: throttle_check_thresholds:       Moderate CPU load detected: 48.740002
> Jul 20 00:20:12 [32512] ha-idg-1       crmd:     info: throttle_send_command:   New throttle mode: 0010 (was 0100)
> Jul 20 00:20:42 [32512] ha-idg-1       crmd:     info: throttle_check_thresholds:       Moderate CPU load detected: 41.889999
> Jul 20 00:21:12 [32512] ha-idg-1       crmd:     info: throttle_send_command:   New throttle mode: 0001 (was 0010)
> Jul 20 00:21:56 [12204] ha-idg-1       lrmd:  warning: child_timeout_callback:  dlm_monitor_30000 process (PID 11816) timed out
> Jul 20 00:21:56 [12204] ha-idg-1       lrmd:  warning: operation_finished:      dlm_monitor_30000:11816 - timed out after 20000ms
> Jul 20 00:21:56 [32512] ha-idg-1       crmd:    error: process_lrm_event:       Result of monitor operation for dlm on ha-idg-1: Timed Out | call=1255 key=dlm_monitor_30000 timeout=20000ms
> Jul 20 00:21:56 [32512] ha-idg-1       crmd:     info: exec_alert_list: Sending resource alert via smtp_alert to informatic.idg at helmholtz-muenchen.de
> Jul 20 00:21:56 [12204] ha-idg-1       lrmd:     info: process_lrmd_alert_exec: Executing alert smtp_alert for 8f934e90-12f5-4bad-b4f4-55ac933f01c6
>
> Can that interfere with DLM ?

High load can potentially interfere with just about any process,
including the monitor operation of the ocf:pacemaker:controld resource
agent (which is what timed out) or any of its child processes. High
load can be caused by storage latency, overworking the system, or
other assorted factors.

And as Ulrich correctly noted, the kernel messages occur after the
monitor timeout. They were probably an expected part of the cluster's
attempt to recover the resource.

>
> Bernd_______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

-- 
Regards,

Reid Wahl (He/Him)
Senior Software Engineer, Red Hat
RHEL High Availability - Pacemaker