[ClusterLabs] Fencing errors

Tue May 21 12:19:38 EDT 2019

On Tue, 2019-05-21 at 11:10 +0000, Lopez, Francisco Javier [Global IT]
wrote:
> Hello guys !
> 
> Need your help to try to understand and debug what I'm facing in one
> of my clusters.
> 
> I set up fencing with this detail:
> 
> # pcs -f stonith_cfg stonith create fence_ao_pg01 fence_vmware_soap
> ipaddr=<IP> ssl_insecure=1 login="<User>" passwd="<Passwd>"
> pcmk_reboot_action=reboot pcmk_host_list="ao-pg01-p.axadmin.net"
> power_wait=3 op monitor interval=60s
> # pcs -f stonith_cfg stonith create fence_ao_pg02 fence_vmware_soap
> ipaddr=<IP> ssl_insecure=1 login="<User>" passwd="<Passwd>"
> pcmk_reboot_action=reboot pcmk_host_list="ao-pg02-p.axadmin.net"
> power_wait=3 op monitor interval=60s
> 
> # pcs -f stonith_cfg constraint location fence_ao_pg01 avoids ao-
> pg01-p.axadmin.net=INFINITY
> # pcs -f stonith_cfg constraint location fence_ao_pg02 avoids ao-
> pg02-p.axadmin.net=INFINITY
> 
> # pcs cluster cib-push stonith_cfg
> 
> The pcs status shows all ok during some time and then it turns to:
> 
> [root at ao-pg01-p ~]# pcs status --full
> Cluster name: ao_cl_p_01
> Stack: corosync
> Current DC: ao-pg01-p.axadmin.net (1) (version 1.1.19-8.el7_6.4-
> c3c624ea3d) - partition with quorum
> Last updated: Tue May 21 12:18:46 2019
> Last change: Fri May 17 18:54:32 2019 by hacluster via crmd on ao-
> pg01-p.axadmin.net
> 
> 2 nodes configured
> 3 resources configured
> 
> Online: [ ao-pg01-p.axadmin.net (1) ao-pg02-p.axadmin.net (2) ]
> 
> Full list of resources:
> 
>  ao-cl-p-01-vip01    (ocf::heartbeat:IPaddr2):    Started ao-pg01-
> p.axadmin.net
>  fence_ao_pg01    (stonith:fence_vmware_soap):    Stopped
>  fence_ao_pg02    (stonith:fence_vmware_soap):    Stopped
> 
> Node Attributes:
> * Node ao-pg01-p.axadmin.net (1):
> * Node ao-pg02-p.axadmin.net (2):
> 
> Migration Summary:
> * Node ao-pg02-p.axadmin.net (2):
>    fence_ao_pg01: migration-threshold=1000000 fail-count=1000000
> last-failure='Sat May 18 00:22:22 2019'
> * Node ao-pg01-p.axadmin.net (1):
>    fence_ao_pg02: migration-threshold=1000000 fail-count=1000000
> last-failure='Fri May 17 20:52:53 2019'
> 
> Failed Actions:
> * fence_ao_pg01_start_0 on ao-pg02-p.axadmin.net 'unknown error' (1):
> call=22, status=Timed Out, exitreason='',
>     last-rc-change='Sat May 18 00:19:49 2019', queued=0ms,
> exec=20022ms
> * fence_ao_pg02_start_0 on ao-pg01-p.axadmin.net 'unknown error' (1):
> call=84, status=Timed Out, exitreason='',
>     last-rc-change='Fri May 17 20:52:33 2019', queued=0ms,
> exec=20032ms
> 
> PCSD Status:
>   ao-pg02-p.axadmin.net: Online
>   ao-pg01-p.axadmin.net: Online
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
> 
> 
> From the output I see there seems to be a 'Timed Out' but I'd like to
> understand if this is a configuration issue
> or something else I'm not aware of.

When pacemaker starts a fence device, it issues a monitor command to
the fence agent. That command is what's timing out here.

The first thing I'd try is running the monitor command manually using
the parameters in the device configuration. The fence agent likely has
a debug option you could turn on to get more details.

> 
> I'm attaching part of the log that shows the problem related to 17-
> May.
> 
> Regards
>         Francisco Javier	 	Lopez	  IT System Engineer	 | 	
> Global IT	  O: +34 619 728 249	 | 	M: +34 619 728 249	
>  | 
> franciscojavier.lopez at solera.com	 | 	Solera.com	  Aud
> atex Datos, S.A.	 | 	Avda. de Bruselas, 36, Salida 16, A‑1
>  (Diversia)	, 	Alcobendas	, 	Madrid	, 	28108
> 	, 	Spain		 
> 
> " Este e-mail y sus archivos adjuntos son confidenciales y están
> dirigidos exclusivamente a la(s) persona(s) destinataria prevista. Si
> ha recibido este mensaje por error, por favor, notifique
> inmediatamente al remitente y elimine este mensaje. La empresa no
> firma contratos por e-mail y todas las negociaciones están sujetas a
> la firma de un contrato por escrito. 
> 
> This e-mail and any attached files are confidential and intended for
> the named addressee(s) only. If you have received this message in
> error, please notify the sender and delete the email immediately. The
> company does not conclude contracts by email and all negotiations are
> subject to written contract. "
>  _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>