[ClusterLabs] Stonith stops after vSphere restart

Thu Feb 22 06:28:45 EST 2018

Stonith resource state should have no impact on actual stonith
operation. It only reflects whether monitor was successful or not and
serves as warning to administrator that something may be wrong. It
should automatically clear itself after failure-timeout has expired.

On Thu, Feb 22, 2018 at 1:58 PM,  <jota at disroot.org> wrote:
>
> Hi,
>
> I have a 2 node pacemaker cluster configured with the fence agent
> vmware_soap.
> Everything works fine until the vCenter is restarted. After that, stonith
> fails and stop.
>
> [root at node1 ~]# pcs status
> Cluster name: psqltest
> Stack: corosync
> Current DC: node2 (version 1.1.16-12.el7_4.7-94ff4df) - partition with
> quorum
> Last updated: Thu Feb 22 11:30:22 2018
> Last change: Mon Feb 19 09:28:37 2018 by root via crm_resource on node1
>
> 2 nodes configured
> 6 resources configured
>
> Online: [ node1 node2 ]
>
> Full list of resources:
>
> Master/Slave Set: ms_drbd_psqltest [drbd_psqltest]
> Masters: [ node1 ]
> Slaves: [ node2 ]
> Resource Group: pgsqltest
> psqltestfs (ocf::heartbeat:Filesystem): Started node1
> psqltest_vip (ocf::heartbeat:IPaddr2): Started node1
> postgresql-94 (ocf::heartbeat:pgsql): Started node1
> vmware_soap (stonith:fence_vmware_soap): Stopped
>
> Failed Actions:
> * vmware_soap_start_0 on node1 'unknown error' (1): call=38, status=Error,
> exitreason='none',
> last-rc-change='Thu Feb 22 10:55:46 2018', queued=0ms, exec=5374ms
> * vmware_soap_start_0 on node2 'unknown error' (1): call=56, status=Error,
> exitreason='none',
> last-rc-change='Thu Feb 22 10:55:39 2018', queued=0ms, exec=5479ms
>
> Daemon Status:
> corosync: active/enabled
> pacemaker: active/enabled
> pcsd: active/enabled
>
>
> [root at node1 ~]# pcs stonith show --full
> Resource: vmware_soap (class=stonith type=fence_vmware_soap)
> Attributes: inet4_only=1 ipaddr=192.168.1.1 ipport=443 login=MYDOMAIN\User
> passwd=mypass pcmk_host_list=node1,node2 power_wait=3 ssl_insecure=1 action=
> pcmk_list_timeout=120s pcmk_monitor_timeout=120s pcmk_status_timeout=120s
> Operations: monitor interval=60s (vmware_soap-monitor-interval-60s)
>
>
> I need to manually perform a "resource cleanup vmware_soap" to put it online
> again.
> Is there any way to do this automatically?.
> Is it possible to detect vSphere online again and enable stonith?.
>
> Thanks.
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>