[ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

Thu Jun 18 13:16:14 EDT 2020

Thanks for the replies! I will look at the failure-timeout resource
attribute and at adjusting the timeout from 20 to 30 seconds. It is funny
that the 1000000 tries message is symbolic.

It turns out that the VMware host was down temporarily at the time of the
alerts. I don't know when It came back up but pcs had already given up
trying to reestablish the connection.

On Thu, Jun 18, 2020 at 8:25 AM Ken Gaillot <kgaillot at redhat.com> wrote:

> Note that a failed start of a stonith device will not prevent the
> cluster from using that device for fencing. It just prevents the
> cluster from monitoring the device.
>
> On Thu, 2020-06-18 at 08:20 +0000, Strahil Nikolov wrote:
> > What about second fencing mechanism ?
> > You can add a shared (independent) vmdk as an sbd device. The
> > reconfiguration will require cluster downtime, but this is only
> > necessary once.
> > Once 2 fencing mechanisms are available - you can configure the order
> > easily.
> > Best Regards,
> > Strahil Nikolov
> >
> >
> >
> >
> >
> >
> > В четвъртък, 18 юни 2020 г., 10:29:22 Гринуич+3, Ulrich Windl <
> > ulrich.windl at rz.uni-regensburg.de> написа:
> >
> >
> >
> >
> >
> > Hi!
> >
> > I can't give much detailed advice, but I think any network service
> > should have a timeout of at least 30 Sekonds (you have
> > timeout=20000ms).
> >
> > And "after 1000000 failures" is symbolic, not literal: It means it
> > failed too often, so I won't retry.
> >
> > Regards,
> > Ulrich
> >
> > > > > Howard <hmoneta at gmail.com> schrieb am 17.06.2020 um 21:05 in
> > > > > Nachricht
> >
> > <2817_1592420740_5EEA6983_2817_3_1_CAO51vj6oXjfvhGQz7oOu=Pi+D_cKh5M1g
> > fDL_2tAbKmw
> > mQYeA at mail.gmail.com>:
> > > Hello, recently I received some really great advice from this
> > > community
> > > regarding changing the token timeout value in corosync. Thank you!
> > > Since
> > > then the cluster has been working perfectly with no errors in the
> > > log for
> > > more than a week.
> > >
> > > This morning I logged in to find a stopped stonith device.  If I'm
> > > reading
> > > the log right, it looks like it failed 1 million times in ~20
> > > seconds then
> > > gave up. If you wouldn't mind looking at the logs below, is there
> > > some way
> > > that I can make this more robust so that it can recover?  I'll be
> > > investigating the reason for the timeout but would like to help the
> > > system
> > > recover on its own.
> > >
> > > Servers: RHEL 8.2
> > >
> > > Cluster name: cluster_pgperf2
> > > Stack: corosync
> > > Current DC: srv1 (version 2.0.2-3.el8_1.2-744a30d655) - partition
> > > with
> > > quorum
> > > Last updated: Wed Jun 17 11:47:42 2020
> > > Last change: Tue Jun 16 22:00:29 2020 by root via crm_attribute on
> > > srv1
> > >
> > > 2 nodes configured
> > > 4 resources configured
> > >
> > > Online: [ srv1 srv2 ]
> > >
> > > Full list of resources:
> > >
> > >   Clone Set: pgsqld-clone [pgsqld] (promotable)
> > >       Masters: [ srv1 ]
> > >       Slaves: [ srv2 ]
> > >   pgsql-master-ip        (ocf::heartbeat:IPaddr2):      Started
> > > srv1
> > >   vmfence        (stonith:fence_vmware_soap):    Stopped
> > >
> > > Failed Resource Actions:
> > > * vmfence_start_0 on srv2 'OCF_TIMEOUT' (198): call=19,
> > > status=Timed Out,
> > > exitreason='',
> > >     last-rc-change='Wed Jun 17 08:34:16 2020', queued=7ms,
> > > exec=20184ms
> > > * vmfence_start_0 on srv1 'OCF_TIMEOUT' (198): call=44,
> > > status=Timed Out,
> > > exitreason='',
> > >     last-rc-change='Wed Jun 17 08:33:55 2020', queued=0ms,
> > > exec=20008ms
> > >
> > > Daemon Status:
> > >   corosync: active/disabled
> > >   pacemaker: active/disabled
> > >   pcsd: active/enabled
> > >
> > >   pcs resource config
> > >   Clone: pgsqld-clone
> > >   Meta Attrs: notify=true promotable=true
> > >   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
> > >     Attributes: bindir=/usr/bin pgdata=/var/lib/pgsql/data
> > >     Operations: demote interval=0s timeout=120s (pgsqld-demote-
> > > interval-0s)
> > >                 methods interval=0s timeout=5 (pgsqld-methods-
> > > interval-0s)
> > >                 monitor interval=15s role=Master timeout=60s
> > > (pgsqld-monitor-interval-15s)
> > >                 monitor interval=16s role=Slave timeout=60s
> > > (pgsqld-monitor-interval-16s)
> > >                 notify interval=0s timeout=60s (pgsqld-notify-
> > > interval-0s)
> > >                 promote interval=0s timeout=30s (pgsqld-promote-
> > > interval-0s)
> > >                 reload interval=0s timeout=20 (pgsqld-reload-
> > > interval-0s)
> > >                 start interval=0s timeout=60s (pgsqld-start-
> > > interval-0s)
> > >                 stop interval=0s timeout=60s (pgsqld-stop-interval-
> > > 0s)
> > >                 monitor interval=60s timeout=60s
> > > (pgsqld-monitor-interval-60s)
> > >   Resource: pgsql-master-ip (class=ocf provider=heartbeat
> > > type=IPaddr2)
> > >   Attributes: cidr_netmask=24 ip=xxx.xxx.xxx.xxx
> > >   Operations: monitor interval=10s (pgsql-master-ip-monitor-
> > > interval-10s)
> > >               start interval=0s timeout=20s
> > > (pgsql-master-ip-start-interval-0s)
> > >               stop interval=0s timeout=20s
> > > (pgsql-master-ip-stop-interval-0s)
> > >
> > > pcs stonith config
> > >   Resource: vmfence (class=stonith type=fence_vmware_soap)
> > >   Attributes: ipaddr=xxx.xxx.xxx.xxx login=xxxx\xxxxxxxx
> > > passwd_script=xxxxxxxx pcmk_host_map=srv1:xxxxxxxxx;srv2:yyyyyyyyy
> > > ssl=1
> > > ssl_insecure=1
> > >   Operations: monitor interval=60s (vmfence-monitor-interval-60s)
> > >
> > > pcs resource failcount show
> > > Failcounts for resource 'vmfence'
> > >   srv1: INFINITY
> > >   srv2: INFINITY
> > >
> > > Here are the versions installed:
> > > [postgres at srv1 cluster]$ rpm -qa|grep
> > > "pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"
> > > corosync-3.0.2-3.el8_1.1.x86_64
> > > corosync-qdevice-3.0.0-2.el8.x86_64
> > > corosync-qnetd-3.0.0-2.el8.x86_64
> > > corosynclib-3.0.2-3.el8_1.1.x86_64
> > > fence-agents-vmware-soap-4.2.1-41.el8.noarch
> > > pacemaker-2.0.2-3.el8_1.2.x86_64
> > > pacemaker-cli-2.0.2-3.el8_1.2.x86_64
> > > pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64
> > > pacemaker-libs-2.0.2-3.el8_1.2.x86_64
> > > pacemaker-schemas-2.0.2-3.el8_1.2.noarch
> > > pcs-0.10.2-4.el8.x86_64
> > > resource-agents-paf-2.3.0-1.noarch
> > >
> > > Here are the errors and warnings from the pacemaker.log from the
> > > first
> > > warning until it gave up.
> > >
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-
> > > fenced
> > >   [26722] (child_timeout_callback)        warning:
> > > fence_vmware_soap_monitor_1 process (PID 43095) timed out
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-
> > > fenced
> > >   [26722] (operation_finished)    warning:
> > > fence_vmware_soap_monitor_1:43095 - timed out after 20000ms
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-
> > > controld
> > >   [26726] (process_lrm_event)      error: Result of monitor
> > > operation for
> > > vmfence on srv1: Timed Out | call=39 key=vmfence_monitor_60000
> > > timeout=20000ms
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1
> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
> > > Processing
> > > failed monitor of vmfence on srv1: OCF_TIMEOUT | rc=198
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-
> > > fenced
> > >   [26722] (child_timeout_callback)        warning:
> > > fence_vmware_soap_monitor_1 process (PID 43215) timed out
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-
> > > fenced
> > >   [26722] (operation_finished)    warning:
> > > fence_vmware_soap_monitor_1:43215 - timed out after 20000ms
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-
> > > controld
> > >   [26726] (process_lrm_event)      error: Result of start operation
> > > for
> > > vmfence on srv1: Timed Out | call=44 key=vmfence_start_0
> > > timeout=20000ms
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-
> > > controld
> > >   [26726] (status_from_rc)        warning: Action 39
> > > (vmfence_start_0) on
> > > srv1 failed (target: 0 vs. rc: 198): Error
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1
> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
> > > Processing
> > > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1
> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
> > > Processing
> > > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1
> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
> > > Processing
> > > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1
> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
> > > Processing
> > > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1
> > > pacemaker-schedulerd[26725] (check_migration_threshold)
> > > warning:
> > > Forcing vmfence away from srv1 after 1000000 failures (max=5)
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1
> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
> > > Processing
> > > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1
> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
> > > Processing
> > > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1
> > > pacemaker-schedulerd[26725] (check_migration_threshold)
> > > warning:
> > > Forcing vmfence away from srv1 after 1000000 failures (max=5)
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 pacemaker-
> > > controld
> > >   [26726] (status_from_rc)        warning: Action 38
> > > (vmfence_start_0) on
> > > srv2 failed (target: 0 vs. rc: 198): Error
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
> > > Processing
> > > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
> > > Processing
> > > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
> > > Processing
> > > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
> > > pacemaker-schedulerd[26725] (check_migration_threshold)
> > > warning:
> > > Forcing vmfence away from srv1 after 1000000 failures (max=5)
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
> > > Processing
> > > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
> > > Processing
> > > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
> > > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:
> > > Processing
> > > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
> > > pacemaker-schedulerd[26725] (check_migration_threshold)
> > > warning:
> > > Forcing vmfence away from srv1 after 1000000 failures (max=5)
> > > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1
> > > pacemaker-schedulerd[26725] (check_migration_threshold)
> > > warning:
> > > Forcing vmfence away from srv2 after 1000000 failures (max=5)
> >
> >
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20200618/70e0621c/attachment-0001.htm>