[ClusterLabs] connection timed out fence_virsh monitor stonith

Mon Feb 24 12:17:02 EST 2020

On February 24, 2020 4:56:07 PM GMT+02:00, Luke Camilleri <luke.camilleri at zylacomputing.com> wrote:
>Hello users, I would like to ask for assistance on the below setup
>please, mainly on the monitor fence timeout:
>
>#pcs --version
>0.9.167
>
>#pacemakerd --version
>Pacemaker 1.1.20-5.el7_7.2
>
>#corosync -v
>Corosync Cluster Engine, version '2.4.3'
>Copyright (c) 2006-2009 Red Hat, Inc.
>
># cat /etc/redhat-release
>CentOS Linux release 7.7.1908 (Core)
>
>I have setup a 2-node axigen mail server setup with 1 resource group
>(and 3 resources within) and 2 fence devices (1 for each node).
>
>the hosts file on the nodes is as follows:
>
>#KVM Management nodes
>10.1.4.31 zc-infra-mgmt-node-1
>10.1.4.20 zc-infra-mgmt-node-2
>
>#Service Network
>10.1.4.22        zc-mail-1.domain.com zc-mail-1
>10.1.4.23        zc-mail-2.domain.com zc-mail-2
>
>#High-Availability Network (cross-over link)
>192.168.1.22     zc-mail-1-ha.domain.local zc-mail-1-ha
>192.168.1.23     zc-mail-2-ha.domain.local zc-mail-2-ha
>
>the routable network is 10.1.4.0/24. A VIP is setup as part of the HA
>Cluster resources (IPaddr2) which is 10.1.4.24
>
>the resources are described as follows:
>
># pcs resource show --full
>
> Group: zc-mail-res-group
>
>Resource: zc-mail-ha-Cfs (class=ocf provider=heartbeat type=Filesystem)
>Attributes: device=10.1.3.11:6789,10.1.3.12:6789,10.1.3.13:6789:/
>directory=/var/clusterfs/data/axigen fstype=ceph
>options=name=email,secretfile=/etc/ceph/ceph.key
>statusfile_prefix=ceph_fs_checks_
>Operations: monitor interval=120s
>(zc-mail-ha-Cfs-monitor-interval-120s)
>    notify interval=0s timeout=120s (zc-mail-ha-Cfs-notify-interval-0s)
>      start interval=0s timeout=120s (zc-mail-ha-Cfs-start-interval-0s)
>        stop interval=0s timeout=120s (zc-mail-ha-Cfs-stop-interval-0s)
>
>Resource: zc-mail-ha-vip (class=ocf provider=heartbeat type=IPaddr2)
>   Attributes: cidr_netmask=24 ip=10.1.4.24
>Operations: monitor interval=120s
>(zc-mail-ha-vip-monitor-interval-120s)
>      start interval=0s timeout=120s (zc-mail-ha-vip-start-interval-0s)
>        stop interval=0s timeout=120s (zc-mail-ha-vip-stop-interval-0s)
>
>Resource: zc-mail-ha-svc (class=lsb type=axigen)
>   Meta Attrs: is-managed=true target-role=Started
>Operations: force-reload interval=0s timeout=60
>(zc-mail-ha-svc-force-reload-interval-0s)
>monitor interval=30s timeout=120s OCF_CHECK_LEVEL=20
>(zc-mail-ha-svc-monitor-interval-30s)
>  restart interval=0s timeout=120s (zc-mail-ha-svc-restart-interval-0s)
>      start interval=0s timeout=120s (zc-mail-ha-svc-start-interval-0s)
>        stop interval=0s timeout=120s (zc-mail-ha-svc-stop-interval-0s)
>
># pcs stonith show --full
>
> Resource: fence_zc-mail-1_virsh (class=stonith type=fence_virsh)
>Attributes: delay=0 identity_file=/home/lcami/.ssh/id_rsa
>ipaddr=zc-infra-mgmt-node-1 login=lcami login_timeout=20
>pcmk_host_check=static-list pcmk_host_list=zc-mail-1-ha
>pcmk_host_map=zc-mail-1-ha:zc-infra-mgmt-node-1 port=Axigen-Mail-1
>sudo=1
>Operations: monitor interval=60s
>(fence_zc-mail-1_virsh-monitor-interval-60s)
>
> Resource: fence_zc-mail-2_virsh (class=stonith type=fence_virsh)
>Attributes: identity_file=/home/lcami/.ssh/id_rsa
>ipaddr=zc-infra-mgmt-node-2 login=lcami login_timeout=20
>pcmk_host_check=static-list pcmk_host_list=zc-mail-2-ha
>pcmk_host_map=zc-mail-2-ha:zc-infra-mgmt-node-2 port=Axigen-Mail-2
>sudo=1
>Operations: monitor interval=60s
>(fence_zc-mail-2_virsh-monitor-interval-60s)
>
>Every couple of days I used to receive the following error:
>
>Feb 16 00:00:24 [2051] zc-mail-2.zylacloud.com stonith-ng:   notice:
>operation_finished: fence_virsh_monitor_1:21995:stderr [ 2020-02-16
>00:00:23,996 ERROR: Connection timed out ]
>Feb 16 00:00:24 [2051] zc-mail-2.zylacloud.com stonith-ng:   notice:
>operation_finished: fence_virsh_monitor_1:21995:stderr [  ]
>Feb 16 00:00:24 [2051] zc-mail-2.zylacloud.com stonith-ng:   notice:
>operation_finished: fence_virsh_monitor_1:21995:stderr [  ]
>Feb 16 00:00:24 [2051] zc-mail-2.zylacloud.com stonith-ng:  warning:
>log_action: fence_virsh[21995] stderr: [ 2020-02-16 00:00:23,996 ERROR:
>Connection timed out ]
>Feb 16 00:00:24 [2051] zc-mail-2.zylacloud.com stonith-ng:  warning:
>log_action: fence_virsh[21995] stderr: [  ]
>Feb 16 00:00:24 [2051] zc-mail-2.zylacloud.com stonith-ng:  warning:
>log_action: fence_virsh[21995] stderr: [  ]
>Feb 16 00:00:24 [2051] zc-mail-2.zylacloud.com stonith-ng:   notice:
>log_operation: Operation 'monitor' [21995] for device
>'fence_zc-mail-1_virsh' returned: -62 (Timer expired)
>Feb 16 00:00:24 [2052] zc-mail-2.zylacloud.com       lrmd:     info:
>log_finished: finished - rsc:fence_zc-mail-1_virsh action:start
>call_id:85  exit-code:1 exec-time:5449ms queue-time:0ms
>
>which I concluded was a problem with the login timeout (which was 5
>seconds)
>
>I have therefore incresed this timeut to 20 seconds but the timeout
>persisted:
>
>Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng:   notice:
>operation_finished: fence_virsh_monitor_1:20006:stderr [ 2020-02-23
>00:00:21,102 ERROR: Connection timed out ]
>Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng:   notice:
>operation_finished: fence_virsh_monitor_1:20006:stderr [  ]
>Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng:   notice:
>operation_finished: fence_virsh_monitor_1:20006:stderr [  ]
>Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng:  warning:
>log_action: fence_virsh[20006] stderr: [ 2020-02-23 00:00:21,102 ERROR:
>Connection timed out ]
>Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng:  warning:
>log_action: fence_virsh[20006] stderr: [  ]
>Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng:  warning:
>log_action: fence_virsh[20006] stderr: [  ]
>Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng:   notice:
>log_operation: Operation 'monitor' [20006] for device
>'fence_zc-mail-1_virsh' returned: -62 (Timer expired)
>Feb 23 00:00:21 [24637] zc-mail-2.zylacloud.com       crmd:    error:
>process_lrm_event: Result of monitor operation for
>fence_zc-mail-1_virsh on zc-mail-2-ha: Timed Out | call=30
>key=fence_zc-mail-1_virsh_monitor_60000 timeout=20000ms
>
>There is also a constraint as shown below so that the fencing "agent"
>runs on the opposite node to be restarted:
>
># pcs constraint show --full
>
>Location Constraints:
>
>  Resource: fence_zc-mail-1_virsh
>Enabled on: zc-mail-2-ha (score:INFINITY) (role: Started)
>(id:cli-prefer-fence_zc-mail-1_virsh)
>
>  Resource: fence_zc-mail-2_virsh
>Enabled on: zc-mail-1-ha (score:INFINITY) (role: Started)
>(id:cli-prefer-fence_zc-mail-2_virsh)
>
>Ordering Constraints:
>
>Colocation Constraints:
>
>Ticket Constraints:

I notice that the issue happens at 00:00 on both days .
Have you checked  for a backup or other cron job that is 'overloading' the virtualization host ?

Anything in libvirt logs or in the hosts' /var/log/messages ?

Best Regards,
Strahil Nikolov