[ClusterLabs] Antw: Re: Antw: [EXT] VIP monitor Timed Out

Strahil Nikolov hunter86_bg at yahoo.com
Tue Jul 20 11:47:10 EDT 2021


I think Ulrich was ment the "dirty" buffers like the ones described at https://www.suse.com/support/kb/doc/?id=000017857 


Based on my experience, you should lower the background dirty tunable as low as possible (let's say 500-600MB) and increase the other tunable at least the double.

Keep in mind that you can use either dirty_ratio or dirty_bytes and either  dirty_background_ratio or dirty_background_bytes , but never both.


Best Regards,
Strahil Nikolov




В вторник, 20 юли 2021 г., 18:04:36 ч. Гринуич+3, PASERO Florent <florent.pasero at externe.bnpparibas.com> написа: 





Thanks Ulrich !

Could you explain me what to do about the tuning of the kernel to limit the amount of dirty buffers ?

Br,
Florent 


Classification : Internal

-----Message d'origine-----
De : Users <users-bounces at clusterlabs.org> De la part de Ulrich Windl
Envoyé : mardi 20 juillet 2021 12:02
À : users at clusterlabs.org
Objet : [ClusterLabs] Antw: Re: Antw: [EXT] VIP monitor Timed Out

Hi!

In the commands traced, no command (that is the monitor) too more than 3 seconds, so that either is *not* the timeout, or pacemaker got significantly delayed.
One reason I could imagine is a "read stall". For example you could trigger such if you rapidly fill your block cache with dirty blocks (to be written) and some read request would have to wait for buffers (to become written, thus available). However if you are writing like mad, available read buffers might be rare.
Fortunately you can tune the kernel to limit the amount of dirty buffers.
I'm not saying that is your problem, but the trace looks OK.

Regards,
Ulrich

>>> PASERO Florent <florent.pasero at externe.bnpparibas.com> schrieb am
20.07.2021 um
11:51 in Nachricht
<PR0P264MB2139D47D0ACF66A81F3DCD37B4E29 at PR0P264MB2139.FRAP264.PROD.OUTLOOK.COM>:

> Hi,
> 
> Once or twice a week, we have a 'Timed out' on our VIP.
> 
> The last :
> Cluster Summary:
>  * Stack: corosync
>  * Current DC: server07 (version 2.0.5-9.el8_4.1-ba59be7122) - 
> partition with quorum
>  * Last updated: Tue Jul 20 11:39:22 2021
>  * Last change:  Mon Jul  5 09:42:14 2021 by hacluster via cibadmin 
> on
> server06
>  * 2 nodes configured
>  * 2 resource instances configured
> 
> Node List:
>  * Online: [ server06 server07 ]
> 
> Active Resources:
>  * Resource Group: zbx_prod_Web_Core:
>    * VIP      (ocf::heartbeat:IPaddr2):        Started server07
>    * ZabbixServer      (systemd:zabbix-server):        Started server07
> 
> Failed Resource Actions:
>  * VIP_monitor_10000 on server07 'error' (1): call=123, status='Timed 
> Out',

> exitreason='', last-rc-change='2021-07-19 15:02:27 +02:00', 
> queued=0ms, exec=0ms
> 
> Any idea ? because nothing very revealing in the following log files.
> 
> Here are the monitoring files just before and just after the time out.
> 
> VIP.monitor.2021-07-19.15:01:27 :
> +++ 15:01:27: ocf_start_trace:999: echo
> +++ 15:01:27: ocf_start_trace:999: printenv
> +++ 15:01:27: ocf_start_trace:999: sort
> ++ 15:01:27: ocf_start_trace:999: env='
> HA_LOGFACILITY=daemon
> HA_LOGFILE=/var/log/pacemaker/pacemaker.log
> HA_cluster_type=corosync
> HA_debug=0
> HA_logfacility=daemon
> HA_logfile=/var/log/pacemaker/pacemaker.log
> HA_mcp=true
> HA_quorum_type=corosync
> INVOCATION_ID=5cd03e610fbf4a9bb3ffe2b30e1fb5d4
> JOURNAL_STREAM=9:4433035
> LC_ALL=C
> OCF_EXIT_REASON_PREFIX=ocf-exit-reason:
> OCF_RA_VERSION_MAJOR=1
> OCF_RA_VERSION_MINOR=0
> OCF_RESKEY_CRM_meta_interval=10000
> OCF_RESKEY_CRM_meta_name=monitor
> OCF_RESKEY_CRM_meta_on_node=server07
> OCF_RESKEY_CRM_meta_on_node_uuid=2
> OCF_RESKEY_CRM_meta_timeout=20000
> OCF_RESKEY_crm_feature_set=3.7.1
> OCF_RESKEY_ip=10.0.0.67
> OCF_RESKEY_monitor_retries=10
> OCF_RESKEY_trace_file=/apps/Zabbix_Log/Core
> OCF_RESKEY_trace_ra=1
> OCF_RESOURCE_INSTANCE=VIP
> OCF_RESOURCE_PROVIDER=heartbeat
> OCF_RESOURCE_TYPE=IPaddr2
> OCF_ROOT=/usr/lib/ocf
>
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/sbin:
> /usr/bin:/usr/ucb
> PCMK_cluster_type=corosync
> PCMK_debug=0
> PCMK_logfacility=daemon
> PCMK_logfile=/var/log/pacemaker/pacemaker.log
> PCMK_mcp=true
> PCMK_quorum_type=corosync
> PCMK_service=pacemaker-execd
> PCMK_watchdog=false
> PWD=/var/lib/pacemaker/cores
> SHLVL=1
> VALGRIND_OPTS=--leak-check=full --trace-children=no --vgdb=no
> --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p
> --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions
> --gen-suppressions=all
> _=/usr/bin/printenv
>
__OCF_TRC_DEST=/var/lib/heartbeat/trace_ra/IPaddr2/VIP.monitor.2021-07-19.15
> :01:27
> __OCF_TRC_MANAGE=1'
> ++ 15:01:27: source:1053: ocf_is_true ''
> ++ 15:01:27: ocf_is_true:103: case "$1" in
> ++ 15:01:27: ocf_is_true:103: case "$1" in
> ++ 15:01:27: ocf_is_true:105: false
> + 15:01:27: main:69: . /usr/lib/ocf/lib/heartbeat/findif.sh
> + 15:01:27: main:72: OCF_RESKEY_lvs_support_default=false
> + 15:01:27: main:73: OCF_RESKEY_lvs_ipv6_addrlabel_default=false
> + 15:01:27: main:74: OCF_RESKEY_lvs_ipv6_addrlabel_value_default=99
> + 15:01:27: main:75: 
> + OCF_RESKEY_clusterip_hash_default=sourceip-sourceport
> + 15:01:27: main:76: OCF_RESKEY_unique_clone_address_default=false
> + 15:01:27: main:77: OCF_RESKEY_arp_interval_default=200
> + 15:01:27: main:78: OCF_RESKEY_arp_count_default=5
> + 15:01:27: main:79: OCF_RESKEY_arp_count_refresh_default=0
> + 15:01:27: main:80: OCF_RESKEY_arp_bg_default=true
> + 15:01:27: main:81: OCF_RESKEY_run_arping_default=false
> + 15:01:27: main:82: OCF_RESKEY_noprefixroute_default=false
> + 15:01:27: main:83: OCF_RESKEY_preferred_lft_default=forever
> + 15:01:27: main:84: OCF_RESKEY_monitor_retries=1
> + 15:01:27: main:86: : false
> + 15:01:27: main:87: : false
> + 15:01:27: main:88: : 99
> + 15:01:27: main:89: : sourceip-sourceport
> + 15:01:27: main:90: : false
> + 15:01:27: main:91: : 200
> + 15:01:27: main:92: : 5
> + 15:01:27: main:93: : 0
> + 15:01:27: main:94: : true
> + 15:01:27: main:95: : false
> + 15:01:27: main:96: : false
> + 15:01:27: main:97: : forever
> + 15:01:27: main:98: : 1
> + 15:01:27: main:101: SENDARP=/usr/libexec/heartbeat/send_arp
> + 15:01:27: main:102: SENDUA=/usr/libexec/heartbeat/send_ua
> + 15:01:27: main:103: FINDIF=findif
> + 15:01:27: main:104: VLDIR=/run/resource-agents
> + 15:01:27: main:105: SENDARPPIDDIR=/run/resource-agents
> + 15:01:27: main:106: 
> CIP_lockfile=/run/resource-agents/IPaddr2-CIP-10.0.0.67
> + 15:01:27: main:108: IPADDR2_CIP_IPTABLES=iptables
> + 15:01:27: main:1261: ocf_is_true false
> + 15:01:27: ocf_is_true:103: case "$1" in
> + 15:01:27: ocf_is_true:105: false
> + 15:01:27: main:1268: case $__OCF_ACTION in
> + 15:01:27: main:1276: ip_validate
> + 15:01:27: ip_validate:1184: check_binary ip
> + 15:01:27: check_binary:57: have_binary ip
> + 15:01:27: have_binary:69: '[' '' = 1 ']'
> ++ 15:01:27: have_binary:72: echo ip
> ++ 15:01:27: have_binary:72: sed -e 's/ -.*//'
> + 15:01:27: have_binary:72: local bin=ip
> ++ 15:01:27: have_binary:73: which ip
> + 15:01:27: have_binary:73: test -x /usr/sbin/ip
> + 15:01:27: ip_validate:1185: IP_CIP=
> + 15:01:27: ip_validate:1187: ip_init
> + 15:01:27: ip_init:423: local rc
> ++ 15:01:27: ip_init:425: uname -s
> + 15:01:27: ip_init:425: '[' XLinux '!=' XLinux ']'
> + 15:01:27: ip_init:430: '[' X10.0.0.67 = X ']'
> + 15:01:27: ip_init:436: case $__OCF_ACTION in
> + 15:01:27: ip_init:438: true
> + 15:01:27: ip_init:441: : 'YAY!'
> + 15:01:27: ip_init:447: BASEIP=10.0.0.67
> + 15:01:27: ip_init:448: BRDCAST=
> + 15:01:27: ip_init:449: NIC=
> + 15:01:27: ip_init:453: '[' '!' -z '' -a -z '' ']'
> + 15:01:27: ip_init:458: NETMASK=
> + 15:01:27: ip_init:459: IFLABEL=
> + 15:01:27: ip_init:460: IF_MAC=
> + 15:01:27: ip_init:462: IP_INC_GLOBAL=1
> ++ 15:01:27: ip_init:463: expr 0 + 1
> + 15:01:27: ip_init:463: IP_INC_NO=1
> + 15:01:27: ip_init:465: ocf_is_true false
> + 15:01:27: ocf_is_true:103: case "$1" in
> + 15:01:27: ocf_is_true:105: false
> + 15:01:27: ip_init:470: ocf_is_decimal 1
> + 15:01:27: ocf_is_decimal:94: case "$1" in
> + 15:01:27: ocf_is_decimal:98: true
> + 15:01:27: ip_init:470: '[' 1 -gt 0 ']'
> + 15:01:27: ip_init:471: :
> + 15:01:27: ip_init:477: echo 10.0.0.67
> + 15:01:27: ip_init:477: grep -qs :
> + 15:01:27: ip_init:478: '[' 1 -ne 0 ']'
> + 15:01:27: ip_init:479: FAMILY=inet
> + 15:01:27: ip_init:480: ocf_is_true false
> + 15:01:27: ocf_is_true:103: case "$1" in
> + 15:01:27: ocf_is_true:105: false
> + 15:01:27: ip_init:507: case $NIC in
> + 15:01:27: ip_init:507: case $NIC in
> ++ 15:01:27: ip_init:518: findif
> ++ 15:01:27: findif:197: local match=10.0.0.67
> ++ 15:01:27: findif:198: local family
> ++ 15:01:27: findif:199: local scope
> ++ 15:01:27: findif:200: local nic=
> ++ 15:01:27: findif:201: local netmask=
> ++ 15:01:27: findif:202: local brdcast=
> ++ 15:01:27: findif:204: echo 10.0.0.67
> ++ 15:01:27: findif:204: grep -qs :
> ++ 15:01:27: findif:205: '[' 1 = 0 ']'
> ++ 15:01:27: findif:208: family=inet
> ++ 15:01:27: findif:209: scope='scope link'
> ++ 15:01:27: findif:211: findif_check_params inet
> ++ 15:01:27: findif_check_params:123: local family=inet
> ++ 15:01:27: findif_check_params:124: local match=10.0.0.67
> ++ 15:01:27: findif_check_params:125: local nic=
> ++ 15:01:27: findif_check_params:127: netmask=
> ++ 15:01:27: findif_check_params:128: local brdcast=
> ++ 15:01:27: findif_check_params:129: local errmsg
> ++ 15:01:27: findif_check_params:131: 
> ++ maybe_convert_dotted_quad_to_cidr
> ++ 15:01:27: maybe_convert_dotted_quad_to_cidr:55: case $netmask in
> ++ 15:01:27: maybe_convert_dotted_quad_to_cidr:55: case $netmask in
> ++ 15:01:27: maybe_convert_dotted_quad_to_cidr:68: return
> ++ 15:01:27: findif_check_params:135: case $__OCF_ACTION in
> ++ 15:01:27: findif_check_params:135: case $__OCF_ACTION in
> ++ 15:01:27: findif_check_params:137: return 0
> ++ 15:01:27: findif:213: '[' -n '' ']'
> ++ 15:01:27: findif:216: '[' -n '' ']'
> +++ 15:01:27: findif:220: ip -o -f inet route list match 10.0.0.67 
> +++ scope
> link
> +++ 15:01:27: findif:220: awk 'BEGIN{best=0} /\// { mask=$1; 
> +++ sub(".*/", "",

> mask); if( int(mask)>=best ) { best=int(mask); best_ln=$0; } } 
> END{print best_ln}'
> ++ 15:01:27: findif:220: set -- 10.0.0.0/24 dev team0 proto kernel src
> 10.0.0.66 metric 350
> ++ 15:01:27: findif:222: '[' 9 = 0 ']'
> ++ 15:01:27: findif:229: '[' -z '' -o -z '' ']'
> ++ 15:01:27: findif:230: '[' 9 = 0 ']'
> ++ 15:01:27: findif:234: case $1 in
> ++ 15:01:27: findif:234: case $1 in
> ++ 15:01:27: findif:235: : OK
> ++ 15:01:27: findif:243: '[' -z '' ']'
> ++ 15:01:27: findif:243: nic=team0
> ++ 15:01:27: findif:244: '[' -z '' ']'
> ++ 15:01:27: findif:244: netmask=24
> ++ 15:01:27: findif:245: '[' inet = inet ']'
> ++ 15:01:27: findif:246: '[' -z '' ']'
> ++ 15:01:27: findif:247: '[' -n 10.0.0.66 ']'
> +++ 15:01:27: findif:248: ip -o -f inet addr show
> +++ 15:01:27: findif:248: grep 10.0.0.66
> ++ 15:01:27: findif:248: set -- 5: team0 inet 10.0.0.66/24 brd 
> ++ 10.0.0.255
> scope global noprefixroute 'team0\' valid_lft forever preferred_lft 
> forever
> ++ 15:01:27: findif:249: '[' brd = brd ']'
> ++ 15:01:27: findif:249: brdcast=10.0.0.255
> ++ 15:01:27: findif:258: echo 'team0 netmask 24 broadcast 10.0.0.255'
> ++ 15:01:27: findif:259: return 0
> + 15:01:27: ip_init:507: case $NIC in
> + 15:01:27: ip_init:518: NICINFO='team0 netmask 24 broadcast 10.0.0.255'
> + 15:01:27: ip_init:519: rc=0
> + 15:01:27: ip_init:521: '[' 0 -eq 0 ']'
> ++ 15:01:27: ip_init:523: echo 'team0 netmask 24 broadcast 10.0.0.255'
> ++ 15:01:27: ip_init:523: sed -e 's/netmask\ //;s/broadcast\ //'
> + 15:01:27: ip_init:523: NICINFO='team0 24 10.0.0.255'
> ++ 15:01:27: ip_init:524: echo 'team0 24 10.0.0.255'
> ++ 15:01:27: ip_init:524: cut '-d ' -f1
> + 15:01:27: ip_init:524: NIC=team0
> ++ 15:01:27: ip_init:525: echo 'team0 24 10.0.0.255'
> ++ 15:01:27: ip_init:525: cut '-d ' -f2
> + 15:01:27: ip_init:525: NETMASK=24
> ++ 15:01:27: ip_init:526: echo 'team0 24 10.0.0.255'
> ++ 15:01:27: ip_init:526: cut '-d ' -f3
> + 15:01:27: ip_init:526: BRDCAST=10.0.0.255
> + 15:01:27: ip_init:541: 
> SENDARPPIDFILE=/run/resource-agents/send_arp-10.0.0.67
> + 15:01:27: ip_init:543: '[' -n '' ']'
> + 15:01:27: ip_init:551: '[' 1 -gt 1 ']'
> + 15:01:27: ip_validate:1189: set_send_arp_program
> + 15:01:27: set_send_arp_program:1149: ARP_SENDER=send_arp
> + 15:01:27: set_send_arp_program:1150: '[' -n '' ']'
> + 15:01:27: set_send_arp_program:1171: is_infiniband
> + 15:01:27: is_infiniband:767: grep link/infiniband
> + 15:01:27: is_infiniband:767: ip link show team0
> + 15:01:27: ip_validate:1191: '[' -n '' ']'
> + 15:01:27: ip_validate:1202: ocf_is_true false
> + 15:01:27: ocf_is_true:103: case "$1" in
> + 15:01:27: ocf_is_true:105: false
> + 15:01:27: ip_validate:1208: ocf_is_decimal 200
> + 15:01:27: ocf_is_decimal:94: case "$1" in
> + 15:01:27: ocf_is_decimal:98: true
> + 15:01:27: ip_validate:1208: '[' 200 -gt 0 ']'
> + 15:01:27: ip_validate:1209: :
> + 15:01:27: ip_validate:1215: ocf_is_decimal 5
> + 15:01:27: ocf_is_decimal:94: case "$1" in
> + 15:01:27: ocf_is_decimal:98: true
> + 15:01:28: ip_validate:1215: '[' 5 -gt 0 ']'
> + 15:01:28: ip_validate:1216: :
> + 15:01:28: ip_validate:1222: '[' -z forever ']'
> + 15:01:28: ip_validate:1227: '[' -n '' ']'
> + 15:01:28: main:1278: case $__OCF_ACTION in
> + 15:01:28: main:1292: ip_monitor
> ++ 15:01:28: ip_monitor:1131: ip_served
> ++ 15:01:28: ip_served:925: '[' -z team0 ']'
> +++ 15:01:28: ip_served:930: find_interface 10.0.0.67 24
> +++ 15:01:28: find_interface:579: local ipaddr=10.0.0.67
> +++ 15:01:28: find_interface:580: local netmask=24
> +++ 15:01:28: find_interface:581: local iface=
> ++++ 15:01:28: find_interface:586: seq 1 1
> +++ 15:01:28: find_interface:586: for i in $(seq 1
> $OCF_RESKEY_monitor_retries)
> ++++ 15:01:28: find_interface:590: ip -o -f inet addr show
> ++++ 15:01:28: find_interface:590: cut -d ' ' -f2
> ++++ 15:01:28: find_interface:590: grep '\ 10.0.0.67/24'
> ++++ 15:01:28: find_interface:590: grep -v '^ipsec[0-9][0-9]*$'
> +++ 15:01:28: find_interface:590: iface=team0
> +++ 15:01:28: find_interface:592: '[' -n team0 ']'
> +++ 15:01:28: find_interface:593: break
> +++ 15:01:28: find_interface:601: echo team0
> +++ 15:01:28: find_interface:602: return 0
> ++ 15:01:28: ip_served:930: cur_nic=team0
> ++ 15:01:28: ip_served:932: '[' -z team0 ']'
> ++ 15:01:28: ip_served:937: '[' -z '' ']'
> ++ 15:01:28: ip_served:938: for i in $cur_nic
> ++ 15:01:28: ip_served:940: '[' team0 = team0 ']'
> ++ 15:01:28: ip_served:941: echo ok
> ++ 15:01:28: ip_served:942: return 0
> + 15:01:28: ip_monitor:1131: local ip_status=ok
> + 15:01:28: ip_monitor:1132: case $ip_status in
> + 15:01:28: ip_monitor:1134: run_arp_sender refresh
> + 15:01:28: run_arp_sender:844: '[' xrefresh = xrefresh ']'
> + 15:01:28: run_arp_sender:845: ARP_COUNT=0
> + 15:01:28: run_arp_sender:846: LOGLEVEL=debug
> + 15:01:28: run_arp_sender:851: '[' 0 -eq 0 ']'
> + 15:01:28: run_arp_sender:852: return
> + 15:01:28: ip_monitor:1135: return 0
> 
> VIP.monitor.2021-07-19.15:03:14 :
> +++ 15:03:14: ocf_start_trace:999: echo
> +++ 15:03:14: ocf_start_trace:999: printenv
> +++ 15:03:14: ocf_start_trace:999: sort
> ++ 15:03:14: ocf_start_trace:999: env='
> HA_LOGFACILITY=daemon
> HA_LOGFILE=/var/log/pacemaker/pacemaker.log
> HA_cluster_type=corosync
> HA_debug=0
> HA_logfacility=daemon
> HA_logfile=/var/log/pacemaker/pacemaker.log
> HA_mcp=true
> HA_quorum_type=corosync
> INVOCATION_ID=5cd03e610fbf4a9bb3ffe2b30e1fb5d4
> JOURNAL_STREAM=9:4433035
> LC_ALL=C
> OCF_EXIT_REASON_PREFIX=ocf-exit-reason:
> OCF_RA_VERSION_MAJOR=1
> OCF_RA_VERSION_MINOR=0
> OCF_RESKEY_CRM_meta_interval=10000
> OCF_RESKEY_CRM_meta_name=monitor
> OCF_RESKEY_CRM_meta_on_node=server07
> OCF_RESKEY_CRM_meta_on_node_uuid=2
> OCF_RESKEY_CRM_meta_timeout=20000
> OCF_RESKEY_crm_feature_set=3.7.1
> OCF_RESKEY_ip=10.0.0.67
> OCF_RESKEY_monitor_retries=10
> OCF_RESKEY_trace_file=/apps/Zabbix_Log/Core
> OCF_RESKEY_trace_ra=1
> OCF_RESOURCE_INSTANCE=VIP
> OCF_RESOURCE_PROVIDER=heartbeat
> OCF_RESOURCE_TYPE=IPaddr2
> OCF_ROOT=/usr/lib/ocf
>
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/sbin:
> /usr/bin:/usr/ucb
> PCMK_cluster_type=corosync
> PCMK_debug=0
> PCMK_logfacility=daemon
> PCMK_logfile=/var/log/pacemaker/pacemaker.log
> PCMK_mcp=true
> PCMK_quorum_type=corosync
> PCMK_service=pacemaker-execd
> PCMK_watchdog=false
> PWD=/var/lib/pacemaker/cores
> SHLVL=1
> VALGRIND_OPTS=--leak-check=full --trace-children=no --vgdb=no
> --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p
> --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions
> --gen-suppressions=all
> _=/usr/bin/printenv
>
__OCF_TRC_DEST=/var/lib/heartbeat/trace_ra/IPaddr2/VIP.monitor.2021-07-19.15
> :03:14
> __OCF_TRC_MANAGE=1'
> ++ 15:03:14: source:1053: ocf_is_true ''
> ++ 15:03:14: ocf_is_true:103: case "$1" in
> ++ 15:03:14: ocf_is_true:103: case "$1" in
> ++ 15:03:14: ocf_is_true:105: false
> + 15:03:14: main:69: . /usr/lib/ocf/lib/heartbeat/findif.sh
> + 15:03:14: main:72: OCF_RESKEY_lvs_support_default=false
> + 15:03:14: main:73: OCF_RESKEY_lvs_ipv6_addrlabel_default=false
> + 15:03:14: main:74: OCF_RESKEY_lvs_ipv6_addrlabel_value_default=99
> + 15:03:14: main:75: 
> + OCF_RESKEY_clusterip_hash_default=sourceip-sourceport
> + 15:03:14: main:76: OCF_RESKEY_unique_clone_address_default=false
> + 15:03:14: main:77: OCF_RESKEY_arp_interval_default=200
> + 15:03:14: main:78: OCF_RESKEY_arp_count_default=5
> + 15:03:14: main:79: OCF_RESKEY_arp_count_refresh_default=0
> + 15:03:14: main:80: OCF_RESKEY_arp_bg_default=true
> + 15:03:14: main:81: OCF_RESKEY_run_arping_default=false
> + 15:03:14: main:82: OCF_RESKEY_noprefixroute_default=false
> + 15:03:14: main:83: OCF_RESKEY_preferred_lft_default=forever
> + 15:03:14: main:84: OCF_RESKEY_monitor_retries=1
> + 15:03:14: main:86: : false
> + 15:03:14: main:87: : false
> + 15:03:14: main:88: : 99
> + 15:03:14: main:89: : sourceip-sourceport
> + 15:03:14: main:90: : false
> + 15:03:14: main:91: : 200
> + 15:03:14: main:92: : 5
> + 15:03:14: main:93: : 0
> + 15:03:14: main:94: : true
> + 15:03:14: main:95: : false
> + 15:03:14: main:96: : false
> + 15:03:14: main:97: : forever
> + 15:03:14: main:98: : 1
> + 15:03:14: main:101: SENDARP=/usr/libexec/heartbeat/send_arp
> + 15:03:14: main:102: SENDUA=/usr/libexec/heartbeat/send_ua
> + 15:03:14: main:103: FINDIF=findif
> + 15:03:14: main:104: VLDIR=/run/resource-agents
> + 15:03:14: main:105: SENDARPPIDDIR=/run/resource-agents
> + 15:03:14: main:106: 
> CIP_lockfile=/run/resource-agents/IPaddr2-CIP-10.0.0.67
> + 15:03:14: main:108: IPADDR2_CIP_IPTABLES=iptables
> + 15:03:14: main:1261: ocf_is_true false
> + 15:03:14: ocf_is_true:103: case "$1" in
> + 15:03:14: ocf_is_true:105: false
> + 15:03:14: main:1268: case $__OCF_ACTION in
> + 15:03:14: main:1276: ip_validate
> + 15:03:14: ip_validate:1184: check_binary ip
> + 15:03:14: check_binary:57: have_binary ip
> + 15:03:14: have_binary:69: '[' '' = 1 ']'
> ++ 15:03:14: have_binary:72: echo ip
> ++ 15:03:14: have_binary:72: sed -e 's/ -.*//'
> + 15:03:14: have_binary:72: local bin=ip
> ++ 15:03:14: have_binary:73: which ip
> + 15:03:14: have_binary:73: test -x /usr/sbin/ip
> + 15:03:14: ip_validate:1185: IP_CIP=
> + 15:03:14: ip_validate:1187: ip_init
> + 15:03:14: ip_init:423: local rc
> ++ 15:03:14: ip_init:425: uname -s
> + 15:03:14: ip_init:425: '[' XLinux '!=' XLinux ']'
> + 15:03:14: ip_init:430: '[' X10.0.0.67 = X ']'
> + 15:03:14: ip_init:436: case $__OCF_ACTION in
> + 15:03:14: ip_init:438: true
> + 15:03:14: ip_init:441: : 'YAY!'
> + 15:03:14: ip_init:447: BASEIP=10.0.0.67
> + 15:03:14: ip_init:448: BRDCAST=
> + 15:03:14: ip_init:449: NIC=
> + 15:03:14: ip_init:453: '[' '!' -z '' -a -z '' ']'
> + 15:03:14: ip_init:458: NETMASK=
> + 15:03:14: ip_init:459: IFLABEL=
> + 15:03:14: ip_init:460: IF_MAC=
> + 15:03:14: ip_init:462: IP_INC_GLOBAL=1
> ++ 15:03:14: ip_init:463: expr 0 + 1
> + 15:03:14: ip_init:463: IP_INC_NO=1
> + 15:03:14: ip_init:465: ocf_is_true false
> + 15:03:14: ocf_is_true:103: case "$1" in
> + 15:03:14: ocf_is_true:105: false
> + 15:03:14: ip_init:470: ocf_is_decimal 1
> + 15:03:14: ocf_is_decimal:94: case "$1" in
> + 15:03:14: ocf_is_decimal:98: true
> + 15:03:14: ip_init:470: '[' 1 -gt 0 ']'
> + 15:03:14: ip_init:471: :
> + 15:03:14: ip_init:477: echo 10.0.0.67
> + 15:03:14: ip_init:477: grep -qs :
> + 15:03:14: ip_init:478: '[' 1 -ne 0 ']'
> + 15:03:14: ip_init:479: FAMILY=inet
> + 15:03:14: ip_init:480: ocf_is_true false
> + 15:03:14: ocf_is_true:103: case "$1" in
> + 15:03:14: ocf_is_true:105: false
> + 15:03:14: ip_init:507: case $NIC in
> + 15:03:14: ip_init:507: case $NIC in
> ++ 15:03:14: ip_init:518: findif
> ++ 15:03:14: findif:197: local match=10.0.0.67
> ++ 15:03:14: findif:198: local family
> ++ 15:03:14: findif:199: local scope
> ++ 15:03:14: findif:200: local nic=
> ++ 15:03:14: findif:201: local netmask=
> ++ 15:03:14: findif:202: local brdcast=
> ++ 15:03:14: findif:204: echo 10.0.0.67
> ++ 15:03:14: findif:204: grep -qs :
> ++ 15:03:14: findif:205: '[' 1 = 0 ']'
> ++ 15:03:14: findif:208: family=inet
> ++ 15:03:14: findif:209: scope='scope link'
> ++ 15:03:14: findif:211: findif_check_params inet
> ++ 15:03:14: findif_check_params:123: local family=inet
> ++ 15:03:14: findif_check_params:124: local match=10.0.0.67
> ++ 15:03:14: findif_check_params:125: local nic=
> ++ 15:03:14: findif_check_params:127: netmask=
> ++ 15:03:14: findif_check_params:128: local brdcast=
> ++ 15:03:14: findif_check_params:129: local errmsg
> ++ 15:03:14: findif_check_params:131: 
> ++ maybe_convert_dotted_quad_to_cidr
> ++ 15:03:14: maybe_convert_dotted_quad_to_cidr:55: case $netmask in
> ++ 15:03:14: maybe_convert_dotted_quad_to_cidr:55: case $netmask in
> ++ 15:03:14: maybe_convert_dotted_quad_to_cidr:68: return
> ++ 15:03:14: findif_check_params:135: case $__OCF_ACTION in
> ++ 15:03:14: findif_check_params:135: case $__OCF_ACTION in
> ++ 15:03:14: findif_check_params:137: return 0
> ++ 15:03:14: findif:213: '[' -n '' ']'
> ++ 15:03:14: findif:216: '[' -n '' ']'
> +++ 15:03:14: findif:220: ip -o -f inet route list match 10.0.0.67 
> +++ scope
> link
> +++ 15:03:14: findif:220: awk 'BEGIN{best=0} /\// { mask=$1; 
> +++ sub(".*/", "",

> mask); if( int(mask)>=best ) { best=int(mask); best_ln=$0; } } 
> END{print best_ln}'
> ++ 15:03:14: findif:220: set -- 10.0.0.0/24 dev team0 proto kernel src
> 10.0.0.66 metric 350
> ++ 15:03:14: findif:222: '[' 9 = 0 ']'
> ++ 15:03:14: findif:229: '[' -z '' -o -z '' ']'
> ++ 15:03:14: findif:230: '[' 9 = 0 ']'
> ++ 15:03:14: findif:234: case $1 in
> ++ 15:03:14: findif:234: case $1 in
> ++ 15:03:14: findif:235: : OK
> ++ 15:03:14: findif:243: '[' -z '' ']'
> ++ 15:03:14: findif:243: nic=team0
> ++ 15:03:14: findif:244: '[' -z '' ']'
> ++ 15:03:14: findif:244: netmask=24
> ++ 15:03:14: findif:245: '[' inet = inet ']'
> ++ 15:03:14: findif:246: '[' -z '' ']'
> ++ 15:03:14: findif:247: '[' -n 10.0.0.66 ']'
> +++ 15:03:14: findif:248: ip -o -f inet addr show
> +++ 15:03:14: findif:248: grep 10.0.0.66
> ++ 15:03:14: findif:248: set -- 5: team0 inet 10.0.0.66/24 brd 
> ++ 10.0.0.255
> scope global noprefixroute 'team0\' valid_lft forever preferred_lft 
> forever
> ++ 15:03:14: findif:249: '[' brd = brd ']'
> ++ 15:03:14: findif:249: brdcast=10.0.0.255
> ++ 15:03:14: findif:258: echo 'team0 netmask 24 broadcast 10.0.0.255'
> ++ 15:03:14: findif:259: return 0
> + 15:03:14: ip_init:507: case $NIC in
> + 15:03:14: ip_init:518: NICINFO='team0 netmask 24 broadcast 10.0.0.255'
> + 15:03:14: ip_init:519: rc=0
> + 15:03:14: ip_init:521: '[' 0 -eq 0 ']'
> ++ 15:03:14: ip_init:523: echo 'team0 netmask 24 broadcast 10.0.0.255'
> ++ 15:03:14: ip_init:523: sed -e 's/netmask\ //;s/broadcast\ //'
> + 15:03:14: ip_init:523: NICINFO='team0 24 10.0.0.255'
> ++ 15:03:14: ip_init:524: echo 'team0 24 10.0.0.255'
> ++ 15:03:14: ip_init:524: cut '-d ' -f1
> + 15:03:14: ip_init:524: NIC=team0
> ++ 15:03:14: ip_init:525: echo 'team0 24 10.0.0.255'
> ++ 15:03:14: ip_init:525: cut '-d ' -f2
> + 15:03:14: ip_init:525: NETMASK=24
> ++ 15:03:14: ip_init:526: echo 'team0 24 10.0.0.255'
> ++ 15:03:14: ip_init:526: cut '-d ' -f3
> + 15:03:14: ip_init:526: BRDCAST=10.0.0.255
> + 15:03:14: ip_init:541: 
> SENDARPPIDFILE=/run/resource-agents/send_arp-10.0.0.67
> + 15:03:14: ip_init:543: '[' -n '' ']'
> + 15:03:14: ip_init:551: '[' 1 -gt 1 ']'
> + 15:03:14: ip_validate:1189: set_send_arp_program
> + 15:03:14: set_send_arp_program:1149: ARP_SENDER=send_arp
> + 15:03:14: set_send_arp_program:1150: '[' -n '' ']'
> + 15:03:14: set_send_arp_program:1171: is_infiniband
> + 15:03:14: is_infiniband:767: ip link show team0
> + 15:03:14: is_infiniband:767: grep link/infiniband
> + 15:03:14: ip_validate:1191: '[' -n '' ']'
> + 15:03:14: ip_validate:1202: ocf_is_true false
> + 15:03:14: ocf_is_true:103: case "$1" in
> + 15:03:14: ocf_is_true:105: false
> + 15:03:14: ip_validate:1208: ocf_is_decimal 200
> + 15:03:14: ocf_is_decimal:94: case "$1" in
> + 15:03:14: ocf_is_decimal:98: true
> + 15:03:14: ip_validate:1208: '[' 200 -gt 0 ']'
> + 15:03:14: ip_validate:1209: :
> + 15:03:14: ip_validate:1215: ocf_is_decimal 5
> + 15:03:14: ocf_is_decimal:94: case "$1" in
> + 15:03:14: ocf_is_decimal:98: true
> + 15:03:14: ip_validate:1215: '[' 5 -gt 0 ']'
> + 15:03:14: ip_validate:1216: :
> + 15:03:14: ip_validate:1222: '[' -z forever ']'
> + 15:03:14: ip_validate:1227: '[' -n '' ']'
> + 15:03:14: main:1278: case $__OCF_ACTION in
> + 15:03:14: main:1292: ip_monitor
> ++ 15:03:14: ip_monitor:1131: ip_served
> ++ 15:03:14: ip_served:925: '[' -z team0 ']'
> +++ 15:03:14: ip_served:930: find_interface 10.0.0.67 24
> +++ 15:03:14: find_interface:579: local ipaddr=10.0.0.67
> +++ 15:03:14: find_interface:580: local netmask=24
> +++ 15:03:14: find_interface:581: local iface=
> ++++ 15:03:14: find_interface:586: seq 1 1
> +++ 15:03:14: find_interface:586: for i in $(seq 1
> $OCF_RESKEY_monitor_retries)
> ++++ 15:03:14: find_interface:590: ip -o -f inet addr show
> ++++ 15:03:14: find_interface:590: grep '\ 10.0.0.67/24'
> ++++ 15:03:14: find_interface:590: cut -d ' ' -f2
> ++++ 15:03:14: find_interface:590: grep -v '^ipsec[0-9][0-9]*$'
> +++ 15:03:14: find_interface:590: iface=team0
> +++ 15:03:14: find_interface:592: '[' -n team0 ']'
> +++ 15:03:14: find_interface:593: break
> +++ 15:03:14: find_interface:601: echo team0
> +++ 15:03:14: find_interface:602: return 0
> ++ 15:03:14: ip_served:930: cur_nic=team0
> ++ 15:03:14: ip_served:932: '[' -z team0 ']'
> ++ 15:03:14: ip_served:937: '[' -z '' ']'
> ++ 15:03:14: ip_served:938: for i in $cur_nic
> ++ 15:03:14: ip_served:940: '[' team0 = team0 ']'
> ++ 15:03:14: ip_served:941: echo ok
> ++ 15:03:14: ip_served:942: return 0
> + 15:03:14: ip_monitor:1131: local ip_status=ok
> + 15:03:14: ip_monitor:1132: case $ip_status in
> + 15:03:14: ip_monitor:1134: run_arp_sender refresh
> + 15:03:14: run_arp_sender:844: '[' xrefresh = xrefresh ']'
> + 15:03:14: run_arp_sender:845: ARP_COUNT=0
> + 15:03:14: run_arp_sender:846: LOGLEVEL=debug
> + 15:03:14: run_arp_sender:851: '[' 0 -eq 0 ']'
> + 15:03:14: run_arp_sender:852: return
> + 15:03:14: ip_monitor:1135: return 0
> 
> Best regards,
> 
> Florent
> 
> De : Users
<users-bounces at clusterlabs.org<mailto:users-bounces at clusterlabs.org>>

> De la part de Klaus Wenninger
> Envoyé : lundi 5 juillet 2021 09:14
> À : Cluster Labs - All topics related to open-source clustering 
> welcomed <users at clusterlabs.org<mailto:users at clusterlabs.org>>
> Objet : Re: [ClusterLabs] Antw: [EXT] VIP monitor Timed Out
> 
> Using DHCP? Maybe a glitch/issue during renewal ... but elaborate 
> monitoring

> as suggested should show that ...
> 
> On Mon, Jul 5, 2021 at 9:03 AM Ulrich Windl
>
<Ulrich.Windl at rz.uni-regensburg.de<mailto:Ulrich.Windl at rz.uni-regensburg.de>>

> wrote:
> Hi!
> 
> See "ip_served" and "find_interface" (essentially "$IP2UTIL -o -f 
> $FAMILY addr
> show") in the RA.
> Basically it searches _all_ interfaces for $ipaddr/$netmask to locate 
> the interface when it could also examine the interface and look at the address.
> For many interfaces it could make a difference performance-wise IMHO.
> Maybe so a periodic sampling how long the corresponding command takes 
> for your setup.
> If it's not a timing issue, the interface may actually be gone 
> temporarily,

> or
> the tools could have bugs.
> 
> Regards,
> Ulrich
> 
>>>> PASERO Florent
>
<florent.pasero at externe.bnpparibas.com<mailto:florent.pasero at externe.bnpparib

> as.com>> schrieb am
> 01.07.2021 um
> 17:29 in Nachricht
>
<PR0P264MB21394030D5C5120BB885E95DB4009 at PR0P264MB2139.FRAP264.PROD.OUTLOOK.CO

>
M<mailto:PR0P264MB21394030D5C5120BB885E95DB4009 at PR0P264MB2139.FRAP264.PROD.OUT

> LOOK.COM>>:
> 
>> Hi,
>>
>> Once or twice a week, we have a 'Timed out' on our VIP:
>> ~$ pcs status
>> Cluster name: zbx_pprod_Web_Core
>> Cluster Summary:
>>  * Stack: corosync
>>  * Current DC: #####(version 2.0.5‑9.el8_4.1‑ba59be7122) ‑ partition 
>> with quorum
>>  * Last updated: Mon Jun 28 16:32:09 2021
>>  * Last change:  Mon Jun 14 12:42:57 2021 by root via cibadmin on ######
>>  * 2 nodes configured
>>  * 2 resource instances configured
>>
>> Node List:
>>  * Online: [ ##### #####]
>>
>> Full List of Resources:
>>  * Resource Group: zbx_pprod_Web_Core:
>>    * VIP      (ocf::heartbeat:IPaddr2):        Started #####
>>    * ZabbixServer      (systemd:zabbix‑server):        Started ######
>>
>> Failed Resource Actions:
>>  * VIP_monitor_5000 on ##### 'error' (1): call=69, status='Timed 
>> Out', exitreason='', last‑rc‑change='2021‑06‑24 14:41:57 +02:00', 
>> queued=0ms,
> exec=0ms
>>  * VIP_monitor_5000 on ##### 'error' (1): call=11, status='Timed 
>> Out', exitreason='', last‑rc‑change='2021‑06‑17 14:18:20 +02:00', 
>> queued=0ms,
> exec=0ms
>>
>>
>> We have the same issue on two completely different clusters.
>>
>> We can see in the log :
>> Jun 24 14:41:29 ##### pacemaker‑execd    [1442069]
(child_timeout_callback)
> 
>>    warning: VIP_monitor_5000 process (PID 2752333) timed out
>> Jun 24 14:41:34 #####pacemaker‑execd    [1442069]
(child_timeout_callback)
> 
>>  crit: VIP_monitor_5000 process (PID 2752333) will not die!
>> Jun 24 14:41:57 ##### pacemaker‑execd    [1442069] (operation_finished)
> 
>>    warning: VIP_monitor_5000[2752333] timed out after 20000ms Jun 24 
>> 14:41:57 ##### pacemaker‑controld  [1442072] (process_lrm_event)
>> error: Result of monitor operation for VIP on #####: Timed Out | 
>> call=69
>> key=VIP_monitor_5000 timeout=20000ms
>> Jun 24 14:41:57 ##### pacemaker‑based    [1442067] (cib_process_request)
> 
>>    info: Forwarding cib_modify operation for section status to all
>> (origin=local/crmd/722)
>> Jun 24 14:41:57 ##### pacemaker‑based    [1442067] (cib_perform_op)
>> info: Diff: ‑‑‑ 0.54.443 2
>> Jun 24 14:41:57 ##### pacemaker‑based    [1442067] (cib_perform_op)
>> info: Diff: +++ 0.54.444 (null)
>> Jun 24 14:41:57 ##### pacemaker‑based    [1442067] (cib_perform_op)
>> info: +  /cib:  @num_updates=444
>>
>>
>> Thanks for help
>>
>>
>>
>> Classification : Internal
>> This message and any attachments (the "message") is intended solely 
>> for the intended addressees and is confidential.
>> If you receive this message in error,or are not the intended 
>> recipient(s), please delete it and any copies from your systems and 
>> immediately notify the sender. Any unauthorized view, use that does 
>> not comply with its purpose, dissemination or disclosure, either 
>> whole or partial, is prohibited. Since the internet cannot guarantee 
>> the integrity of this message which may not be reliable, BNP PARIBAS 
>> (and its subsidiaries) shall not be liable for the message if 
>> modified, changed or falsified.
>> Do not print this message unless it is necessary, consider the
environment.
>>
>>
>
‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑
> ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑
>>
>> Ce message et toutes les pieces jointes (ci‑apres le "message") sont 
>> etablis a l'intention exclusive de ses destinataires et sont 
>> confidentiels.
>> Si vous recevez ce message par erreur ou s'il ne vous est pas 
>> destine, merci de le detruire ainsi que toute copie de votre systeme 
>> et d'en
avertir
>> immediatement l'expediteur. Toute lecture non autorisee, toute 
>> utilisation de ce message qui n'est pas conforme a sa destination, 
>> toute diffusion ou
toute
> 
>>
>> publication, totale ou partielle, est interdite. L'Internet ne 
>> permettant pas d'assurer l'integrite de ce message electronique 
>> susceptible d'alteration, BNP
Paribas
> 
>>
>> (et ses filiales) decline(nt) toute responsabilite au titre de ce 
>> message dans l'hypothese ou il aurait ete modifie, deforme ou 
>> falsifie.
>> N'imprimez ce message que si necessaire, pensez a l'environnement.
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
> 
> Classification : Internal



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


More information about the Users mailing list