<div><div dir="auto">Thanks for the replies! I will look at the <span style="color:rgb(49,49,49);font-size:16px;word-spacing:1px">failure-timeout resource attribute and at adjusting the timeout from 20 to 30 seconds. It is funny that the 1000000 tries message is symbolic. </span></div><div dir="auto"><span style="color:rgb(49,49,49);font-size:16px;word-spacing:1px"><br></span></div><div dir="auto"><span style="color:rgb(49,49,49);font-size:16px;word-spacing:1px">It turns out that the VMware host was down temporarily at the time of the alerts. I don't know when It came back up but pcs had already given up trying to reestablish the connection. </span></div></div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 18, 2020 at 8:25 AM Ken Gaillot <<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Note that a failed start of a stonith device will not prevent the<br>

cluster from using that device for fencing. It just prevents the<br>

cluster from monitoring the device.<br>

<br>

On Thu, 2020-06-18 at 08:20 +0000, Strahil Nikolov wrote:<br>

> What about second fencing mechanism ?<br>

> You can add a shared (independent) vmdk as an sbd device. The<br>

> reconfiguration will require cluster downtime, but this is only<br>

> necessary once.<br>

> Once 2 fencing mechanisms are available - you can configure the order<br>

> easily.<br>

> Best Regards,<br>

> Strahil Nikolov<br>

> <br>

> <br>

> <br>

> <br>

> <br>

> <br>

> В четвъртък, 18 юни 2020 г., 10:29:22 Гринуич+3, Ulrich Windl <<br>

> <a href="mailto:ulrich.windl@rz.uni-regensburg.de" target="_blank">ulrich.windl@rz.uni-regensburg.de</a>> написа: <br>

> <br>

> <br>

> <br>

> <br>

> <br>

> Hi!<br>

> <br>

> I can't give much detailed advice, but I think any network service<br>

> should have a timeout of at least 30 Sekonds (you have<br>

> timeout=20000ms).<br>

> <br>

> And "after 1000000 failures" is symbolic, not literal: It means it<br>

> failed too often, so I won't retry.<br>

> <br>

> Regards,<br>

> Ulrich<br>

> <br>

> > > > Howard <<a href="mailto:hmoneta@gmail.com" target="_blank">hmoneta@gmail.com</a>> schrieb am 17.06.2020 um 21:05 in<br>

> > > > Nachricht<br>

> <br>

> <2817_1592420740_5EEA6983_2817_3_1_CAO51vj6oXjfvhGQz7oOu=Pi+D_cKh5M1g<br>

> fDL_2tAbKmw<br>

> <a href="mailto:mQYeA@mail.gmail.com" target="_blank">mQYeA@mail.gmail.com</a>>:<br>

> > Hello, recently I received some really great advice from this<br>

> > community<br>

> > regarding changing the token timeout value in corosync. Thank you!<br>

> > Since<br>

> > then the cluster has been working perfectly with no errors in the<br>

> > log for<br>

> > more than a week.<br>

> > <br>

> > This morning I logged in to find a stopped stonith device.  If I'm<br>

> > reading<br>

> > the log right, it looks like it failed 1 million times in ~20<br>

> > seconds then<br>

> > gave up. If you wouldn't mind looking at the logs below, is there<br>

> > some way<br>

> > that I can make this more robust so that it can recover?  I'll be<br>

> > investigating the reason for the timeout but would like to help the<br>

> > system<br>

> > recover on its own.<br>

> > <br>

> > Servers: RHEL 8.2<br>

> > <br>

> > Cluster name: cluster_pgperf2<br>

> > Stack: corosync<br>

> > Current DC: srv1 (version 2.0.2-3.el8_1.2-744a30d655) - partition<br>

> > with<br>

> > quorum<br>

> > Last updated: Wed Jun 17 11:47:42 2020<br>

> > Last change: Tue Jun 16 22:00:29 2020 by root via crm_attribute on<br>

> > srv1<br>

> > <br>

> > 2 nodes configured<br>

> > 4 resources configured<br>

> > <br>

> > Online: [ srv1 srv2 ]<br>

> > <br>

> > Full list of resources:<br>

> > <br>

> >   Clone Set: pgsqld-clone [pgsqld] (promotable)<br>

> >       Masters: [ srv1 ]<br>

> >       Slaves: [ srv2 ]<br>

> >   pgsql-master-ip        (ocf::heartbeat:IPaddr2):      Started<br>

> > srv1<br>

> >   vmfence        (stonith:fence_vmware_soap):    Stopped<br>

> > <br>

> > Failed Resource Actions:<br>

> > * vmfence_start_0 on srv2 'OCF_TIMEOUT' (198): call=19,<br>

> > status=Timed Out,<br>

> > exitreason='',<br>

> >     last-rc-change='Wed Jun 17 08:34:16 2020', queued=7ms,<br>

> > exec=20184ms<br>

> > * vmfence_start_0 on srv1 'OCF_TIMEOUT' (198): call=44,<br>

> > status=Timed Out,<br>

> > exitreason='',<br>

> >     last-rc-change='Wed Jun 17 08:33:55 2020', queued=0ms,<br>

> > exec=20008ms<br>

> > <br>

> > Daemon Status:<br>

> >   corosync: active/disabled<br>

> >   pacemaker: active/disabled<br>

> >   pcsd: active/enabled<br>

> > <br>

> >   pcs resource config<br>

> >   Clone: pgsqld-clone<br>

> >   Meta Attrs: notify=true promotable=true<br>

> >   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)<br>

> >     Attributes: bindir=/usr/bin pgdata=/var/lib/pgsql/data<br>

> >     Operations: demote interval=0s timeout=120s (pgsqld-demote-<br>

> > interval-0s)<br>

> >                 methods interval=0s timeout=5 (pgsqld-methods-<br>

> > interval-0s)<br>

> >                 monitor interval=15s role=Master timeout=60s<br>

> > (pgsqld-monitor-interval-15s)<br>

> >                 monitor interval=16s role=Slave timeout=60s<br>

> > (pgsqld-monitor-interval-16s)<br>

> >                 notify interval=0s timeout=60s (pgsqld-notify-<br>

> > interval-0s)<br>

> >                 promote interval=0s timeout=30s (pgsqld-promote-<br>

> > interval-0s)<br>

> >                 reload interval=0s timeout=20 (pgsqld-reload-<br>

> > interval-0s)<br>

> >                 start interval=0s timeout=60s (pgsqld-start-<br>

> > interval-0s)<br>

> >                 stop interval=0s timeout=60s (pgsqld-stop-interval-<br>

> > 0s)<br>

> >                 monitor interval=60s timeout=60s<br>

> > (pgsqld-monitor-interval-60s)<br>

> >   Resource: pgsql-master-ip (class=ocf provider=heartbeat<br>

> > type=IPaddr2)<br>

> >   Attributes: cidr_netmask=24 ip=xxx.xxx.xxx.xxx<br>

> >   Operations: monitor interval=10s (pgsql-master-ip-monitor-<br>

> > interval-10s)<br>

> >               start interval=0s timeout=20s<br>

> > (pgsql-master-ip-start-interval-0s)<br>

> >               stop interval=0s timeout=20s<br>

> > (pgsql-master-ip-stop-interval-0s)<br>

> > <br>

> > pcs stonith config<br>

> >   Resource: vmfence (class=stonith type=fence_vmware_soap)<br>

> >   Attributes: ipaddr=xxx.xxx.xxx.xxx login=xxxx\xxxxxxxx<br>

> > passwd_script=xxxxxxxx pcmk_host_map=srv1:xxxxxxxxx;srv2:yyyyyyyyy<br>

> > ssl=1<br>

> > ssl_insecure=1<br>

> >   Operations: monitor interval=60s (vmfence-monitor-interval-60s)<br>

> > <br>

> > pcs resource failcount show<br>

> > Failcounts for resource 'vmfence'<br>

> >   srv1: INFINITY<br>

> >   srv2: INFINITY<br>

> > <br>

> > Here are the versions installed:<br>

> > [postgres@srv1 cluster]$ rpm -qa|grep<br>

> > "pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"<br>

> > corosync-3.0.2-3.el8_1.1.x86_64<br>

> > corosync-qdevice-3.0.0-2.el8.x86_64<br>

> > corosync-qnetd-3.0.0-2.el8.x86_64<br>

> > corosynclib-3.0.2-3.el8_1.1.x86_64<br>

> > fence-agents-vmware-soap-4.2.1-41.el8.noarch<br>

> > pacemaker-2.0.2-3.el8_1.2.x86_64<br>

> > pacemaker-cli-2.0.2-3.el8_1.2.x86_64<br>

> > pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64<br>

> > pacemaker-libs-2.0.2-3.el8_1.2.x86_64<br>

> > pacemaker-schemas-2.0.2-3.el8_1.2.noarch<br>

> > pcs-0.10.2-4.el8.x86_64<br>

> > resource-agents-paf-2.3.0-1.noarch<br>

> > <br>

> > Here are the errors and warnings from the pacemaker.log from the<br>

> > first<br>

> > warning until it gave up.<br>

> > <br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-<br>

> > fenced<br>

> >   [26722] (child_timeout_callback)        warning:<br>

> > fence_vmware_soap_monitor_1 process (PID 43095) timed out<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-<br>

> > fenced<br>

> >   [26722] (operation_finished)    warning:<br>

> > fence_vmware_soap_monitor_1:43095 - timed out after 20000ms<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1 pacemaker-<br>

> > controld<br>

> >   [26726] (process_lrm_event)      error: Result of monitor<br>

> > operation for<br>

> > vmfence on srv1: Timed Out | call=39 key=vmfence_monitor_60000<br>

> > timeout=20000ms<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:33:55 srv1<br>

> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:<br>

> > Processing<br>

> > failed monitor of vmfence on srv1: OCF_TIMEOUT | rc=198<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-<br>

> > fenced<br>

> >   [26722] (child_timeout_callback)        warning:<br>

> > fence_vmware_soap_monitor_1 process (PID 43215) timed out<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-<br>

> > fenced<br>

> >   [26722] (operation_finished)    warning:<br>

> > fence_vmware_soap_monitor_1:43215 - timed out after 20000ms<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-<br>

> > controld<br>

> >   [26726] (process_lrm_event)      error: Result of start operation<br>

> > for<br>

> > vmfence on srv1: Timed Out | call=44 key=vmfence_start_0<br>

> > timeout=20000ms<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1 pacemaker-<br>

> > controld<br>

> >   [26726] (status_from_rc)        warning: Action 39<br>

> > (vmfence_start_0) on<br>

> > srv1 failed (target: 0 vs. rc: 198): Error<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1<br>

> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:<br>

> > Processing<br>

> > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1<br>

> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:<br>

> > Processing<br>

> > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1<br>

> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:<br>

> > Processing<br>

> > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1<br>

> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:<br>

> > Processing<br>

> > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1<br>

> > pacemaker-schedulerd[26725] (check_migration_threshold)     <br>

> > warning:<br>

> > Forcing vmfence away from srv1 after 1000000 failures (max=5)<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1<br>

> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:<br>

> > Processing<br>

> > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1<br>

> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:<br>

> > Processing<br>

> > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:15 srv1<br>

> > pacemaker-schedulerd[26725] (check_migration_threshold)     <br>

> > warning:<br>

> > Forcing vmfence away from srv1 after 1000000 failures (max=5)<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1 pacemaker-<br>

> > controld<br>

> >   [26726] (status_from_rc)        warning: Action 38<br>

> > (vmfence_start_0) on<br>

> > srv2 failed (target: 0 vs. rc: 198): Error<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1<br>

> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:<br>

> > Processing<br>

> > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1<br>

> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:<br>

> > Processing<br>

> > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1<br>

> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:<br>

> > Processing<br>

> > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1<br>

> > pacemaker-schedulerd[26725] (check_migration_threshold)     <br>

> > warning:<br>

> > Forcing vmfence away from srv1 after 1000000 failures (max=5)<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1<br>

> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:<br>

> > Processing<br>

> > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1<br>

> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:<br>

> > Processing<br>

> > failed start of vmfence on srv2: OCF_TIMEOUT | rc=198<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1<br>

> > pacemaker-schedulerd[26725] (unpack_rsc_op_failure)  warning:<br>

> > Processing<br>

> > failed start of vmfence on srv1: OCF_TIMEOUT | rc=198<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1<br>

> > pacemaker-schedulerd[26725] (check_migration_threshold)     <br>

> > warning:<br>

> > Forcing vmfence away from srv1 after 1000000 failures (max=5)<br>

> > /var/log/pacemaker/pacemaker.log:Jun 17 08:34:36 srv1<br>

> > pacemaker-schedulerd[26725] (check_migration_threshold)     <br>

> > warning:<br>

> > Forcing vmfence away from srv2 after 1000000 failures (max=5)<br>

> <br>

> <br>

> <br>

> _______________________________________________<br>

> Manage your subscription:<br>

> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

> <br>

> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

> _______________________________________________<br>

> Manage your subscription:<br>

> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

> <br>

> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

-- <br>

Ken Gaillot <<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>><br>

<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

</blockquote></div></div>