[ClusterLabs] Antw: Re: Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

Fri Jun 19 02:02:19 EDT 2020

>>> Howard <hmoneta at gmail.com> schrieb am 18.06.2020 um 19:16 in Nachricht
<CAO51vj6EpD__ViGQ3Joqx2hCdgB8uSMne8HkHqg8daoQH8hpHg at mail.gmail.com>:
> Thanks for the replies! I will look at the failure-timeout resource
> attribute and at adjusting the timeout from 20 to 30 seconds. It is funny
> that the 1000000 tries message is symbolic.
> 
> It turns out that the VMware host was down temporarily at the time of the
> alerts. I don't know when It came back up but pcs had already given up
> trying to reestablish the connection.

Out of curiosity: Does that mean your cluster node VM did not run on the VM
host the cluster thought it would? Or was the cluster VM dead as well?

> 
> On Thu, Jun 18, 2020 at 8:25 AM Ken Gaillot <kgaillot at redhat.com> wrote:
> 
>> Note that a failed start of a stonith device will not prevent the
>> cluster from using that device for fencing. It just prevents the
>> cluster from monitoring the device.
>>
>> On Thu, 2020-06-18 at 08:20 +0000, Strahil Nikolov wrote:
>> > What about second fencing mechanism ?
>> > You can add a shared (independent) vmdk as an sbd device. The
>> > reconfiguration will require cluster downtime, but this is only
>> > necessary once.
>> > Once 2 fencing mechanisms are available - you can configure the order
>> > easily.
>> > Best Regards,
>> > Strahil Nikolov
>> >
>> >
>> >
>> >
>> >
>> >
>> > В четвъртък, 18 юни 2020 г., 10:29:22 Гринуич+3, Ulrich Windl <
>> > ulrich.windl at rz.uni-regensburg.de> написа:
>> >
>> >
>> >
>> >
>> >
>> > Hi!
>> >
>> > I can't give much detailed advice, but I think any network service
>> > should have a timeout of at least 30 Sekonds (you have
>> > timeout=20000ms).
>> >
>> > And "after 1000000 failures" is symbolic, not literal: It means it
>> > failed too often, so I won't retry.
>> >
>> > Regards,
>> > Ulrich
>> >
>> > > > > Howard <hmoneta at gmail.com> schrieb am 17.06.2020 um 21:05 in
>> > > > > Nachricht
>> >
>> > <2817_1592420740_5EEA6983_2817_3_1_CAO51vj6oXjfvhGQz7oOu=Pi+D_cKh5M1g
>> > fDL_2tAbKmw
>> > mQYeA at mail.gmail.com>:
>> > > Hello, recently I received some really great advice from this
>> > > community
>> > > regarding changing the token timeout value in corosync. Thank you!
>> > > Since
>> > > then the cluster has been working perfectly with no errors in the
>> > > log for
>> > > more than a week.
>> > >
>> > > This morning I logged in to find a stopped stonith device.  If I'm
>> > > reading
>> > > the log right, it looks like it failed 1 million times in ~20
>> > > seconds then
>> > > gave up. If you wouldn't mind looking at the logs below, is there
>> > > some way
>> > > that I can make this more robust so that it can recover?  I'll be
>> > > investigating the reason for the timeout but would like to help the
>> > > system
>> > > recover on its own.
>> > >
>> > > Servers: RHEL 8.2
>> > >
>> > > Cluster name: cluster_pgperf2
>> > > Stack: corosync
>> > > Current DC: srv1 (version 2.0.2-3.el8_1.2-744a30d655) - partition
>> > > with
>> > > quorum
>> > > Last updated: Wed Jun 17 11:47:42 2020
>> > > Last change: Tue Jun 16 22:00:29 2020 by root via crm_attribute on
>> > > srv1
>> > >
>> > > 2 nodes configured
>> > > 4 resources configured
>> > >
>> > > Online: [ srv1 srv2 ]
>> > >
>> > > Full list of resources:
>> > >
>> > >   Clone Set: pgsqld-clone [pgsqld] (promotable)
>> > >       Masters: [ srv1 ]
>> > >       Slaves: [ srv2 ]
>> > >   pgsql-master-ip        (ocf::heartbeat:IPaddr2):      Started
>> > > srv1
>> > >   vmfence        (stonith:fence_vmware_soap):    Stopped
>> > >
>> > > Failed Resource Actions:
>> > > * vmfence_start_0 on srv2 'OCF_TIMEOUT' (198): call=19,
>> > > status=Timed Out,
>> > > exitreason='',
>> > >     last-rc-change='Wed Jun 17 08:34:16 2020', queued=7ms,
>> > > exec=20184ms
>> > > * vmfence_start_0 on srv1 'OCF_TIMEOUT' (198): call=44,
>> > > status=Timed Out,
>> > > exitreason='',
>> > >     last-rc-change='Wed Jun 17 08:33:55 2020', queued=0ms,
>> > > exec=20008ms
>> > >
>> > > Daemon Status:
>> > >   corosync: active/disabled
>> > >   pacemaker: active/disabled
>> > >   pcsd: active/enabled
>> > >
>> > >   pcs resource config
>> > >   Clone: pgsqld-clone
>> > >   Meta Attrs: notify=true promotable=true
>> > >   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
>> > >     Attributes: bindir=/usr/bin pgdata=/var/lib/pgsql/data
>> > >     Operations: demote interval=0s timeout=120s (pgsqld-demote-
>> > > interval-0s)
>> > >                 methods interval=0s timeout=5 (pgsqld-methods-
>> > > interval-0s)
>> > >                 monitor interval=15s role=Master timeout=60s
>> > > (pgsqld-monitor-interval-15s)
>> > >                 monitor interval=16s role=Slave timeout=60s
>> > > (pgsqld-monitor-interval-16s)
>> > >                 notify interval=0s timeout=60s (pgsqld-notify-
>> > > interval-0s)
>> > >                 promote interval=0s timeout=30s (pgsqld-promote-
>> > > interval-0s)
>> > >                 reload interval=0s timeout=20 (pgsqld-reload-
>> > > interval-0s)
>> > >                 start interval=0s timeout=60s (pgsqld-start-
>> > > interval-0s)
>> > >                 stop interval=0s timeout=60s (pgsqld-stop-interval-
>> > > 0s)
>> > >                 monitor interval=60s timeout=60s
>> > > (pgsqld-monitor-interval-60s)
>> > >   Resource: pgsql-master-ip (class=ocf provider=heartbeat
>> > > type=IPaddr2)
>> > >   Attributes: cidr_netmask=24 ip=xxx.xxx.xxx.xxx
>> > >   Operations: monitor interval=10s (pgsql-master-ip-monitor-
>> > > interval-10s)
>> > >               start interval=0s timeout=20s
>> > > (pgsql-master-ip-start-interval-0s)
>> > >               stop interval=0s timeout=20s
>> > > (pgsql-master-ip-stop-interval-0s)
>> > >
>> > > pcs stonith config
>> > >   Resource: vmfence (class=stonith type=fence_vmware_soap)
>> > >   Attributes: ipaddr=xxx.xxx.xxx.xxx login=xxxx\xxxxxxxx
>> > > passwd_script=xxxxxxxx pcmk_host_map=srv1:xxxxxxxxx;srv2:yyyyyyyyy
>> > > ssl=1
>> > > ssl_insecure=1
>> > >   Operations: monitor interval=60s (vmfence-monitor-interval-60s)
>> > >
>> > > pcs resource failcount show
>> > > Failcounts for resource 'vmfence'
>> > >   srv1: INFINITY
>> > >   srv2: INFINITY
>> > >
>> > > Here are the versions installed:
>> > > [postgres at srv1 cluster]$ rpm -qa|grep
>> > > "pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"
>> > > corosync-3.0.2-3.el8_1.1.x86_64
>> > > corosync-qdevice-3.0.0-2.el8.x86_64
>> > > corosync-qnetd-3.0.0-2.el8.x86_64
>> > > corosynclib-3.0.2-3.el8_1.1.x86_64
>> > > fence-agents-vmware-soap-4.2.1-41.el8.noarch
>> > > pacemaker-2.0.2-3.el8_1.2.x86_64
>> > > pacemaker-cli-2.0.2-3.el8_1.2.x86_64
>> > > pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64
>> > > pacemaker-libs-2.0.2-3.el8_1.2.x86_64
>> > > pacemaker-schemas-2.0.2-3.el8_1.2.noarch
>> > > pcs-0.10.2-4.el8.x86_64
>> > > resource-agents-paf-2.3.0-1.noarch
>> > >
>> > > Here are the errors and warnings from the pacemaker.log from the
>> > > first
>> > > warning until it gave up.
>> > >
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-
>> > > fenced
>> > >   [26722] (child_timeout_callback)        warning:
>> > > fence_vmware_soap_monitor_1 process (PID 43095) timed out
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-
>> > > fenced
>> > >   [26722] (operation_finished)    warning:
>> > > fence_vmware_soap_monitor_1:43095 - timed out after 20000ms
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-
>> > > controld
>> > >   [26726] (process_lrm_event)      error: Result of monitor
>> > > operation for
>> > > vmfence on srv1: Timed Out | call=39 key=vmfence_monitor_60000
>> > > timeout=20000ms
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1
>> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
>> > > Processing
>> > > failed monitor of vmfence on srv1: OCF_TIMEOUT | rc=198
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-
>> > > fenced
>> > >   [26722] (child_timeout_callback)        warning:
>> > > fence_vmware_soap_monitor_1 process (PID 43215) timed out
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-
>> > > fenced
>> > >   [26722] (operation_finished)    warning:
>> > > fence_vmware_soap_monitor_1:43215 - timed out after 20000ms
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-
>> > > controld
>> > >   [26726] (process_lrm_event)      error: Result of start operation
>> > > for
>> > > vmfence on srv1: Timed Out | call=44 key=vmfence_start_0
>> > > timeout=20000ms
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-
>> > > controld
>> > >   [26726] (status_from_rc)        warning: Action 39
>> > > (vmfence_start_0) on
>> > > srv1 failed (target: 0 vs. rc: 198): Error
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1
>> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
>> > > Processing
>> > > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1
>> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
>> > > Processing
>> > > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1
>> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
>> > > Processing
>> > > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1
>> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
>> > > Processing
>> > > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1
>> > > pacemaker-schedulerd[26725] (check_migration_threshold)
>> > > warning:
>> > > Forcing vmfence away from srv1 after 1000000 failures (max=5)
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1
>> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
>> > > Processing
>> > > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1
>> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
>> > > Processing
>> > > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1
>> > > pacemaker-schedulerd[26725] (check_migration_threshold)
>> > > warning:
>> > > Forcing vmfence away from srv1 after 1000000 failures (max=5)
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 pacemaker-
>> > > controld
>> > >   [26726] (status_from_rc)        warning: Action 38
>> > > (vmfence_start_0) on
>> > > srv2 failed (target: 0 vs. rc: 198): Error
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
>> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
>> > > Processing
>> > > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
>> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
>> > > Processing
>> > > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
>> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
>> > > Processing
>> > > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
>> > > pacemaker-schedulerd[26725] (check_migration_threshold)
>> > > warning:
>> > > Forcing vmfence away from srv1 after 1000000 failures (max=5)
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
>> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
>> > > Processing
>> > > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
>> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
>> > > Processing
>> > > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
>> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
>> > > Processing
>> > > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
>> > > pacemaker-schedulerd[26725] (check_migration_threshold)
>> > > warning:
>> > > Forcing vmfence away from srv1 after 1000000 failures (max=5)
>> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
>> > > pacemaker-schedulerd[26725] (check_migration_threshold)
>> > > warning:
>> > > Forcing vmfence away from srv2 after 1000000 failures (max=5)
>> >
>> >
>> >
>> > _______________________________________________
>> > Manage your subscription:
>> > https://lists.clusterlabs.org/mailman/listinfo/users 
>> >
>> > ClusterLabs home: https://www.clusterlabs.org/ 
>> > _______________________________________________
>> > Manage your subscription:
>> > https://lists.clusterlabs.org/mailman/listinfo/users 
>> >
>> > ClusterLabs home: https://www.clusterlabs.org/ 
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
>>