[ClusterLabs] <EXT>Re: Fencing errors
Lopez, Francisco Javier [Global IT]
franciscojavier.lopez at solera.com
Thu May 23 14:29:30 EDT 2019
Hello again Ken et all.
I realized about many things investigating this issue but I feel I need a bit more help from you guys.
It's clear the monitoring process is reporting a timeout. Although I've increased this timeout to 30c using pcmk_monitoring_timeout,
and during this last 2 hours the process did not fail, I'd like to understand more in detail how this process works and if I'm
getting a timeout after 20 secs, it looks to me something else could be happening in my systems.
I tried enabling debug again and, as before, the 'debug' option creates the file but does not update anything unless I enable 'verbose'.
Funny thing because when I enable it, I hit a bug and the fencing does not start:
https://bugzilla.redhat.com/show_bug.cgi?id=1549366
I enabled debug at corosync layer and I got some more information that was nice to better understand this issue but still, not enough
information to narrow down where the issue comes from.
Said this, I'd like to know, if there is a way to review more in detail what the monitoring process is doing like ping, status, etc
and it that time is dedicated to the same action all those secs.
Any idea will be more than welcome.
As always, appreciate your help.
Regards
Javier
Francisco Javier Lopez
IT System Engineer | Global IT
O: +34 619 728 249<tel:+34%20619%20728%20249> | M: +34 619 728 249<tel:+34%20619%20728%20249> |
franciscojavier.lopez at solera.com<mailto:franciscojavier.lopez at solera.com> | Solera.com<https://www.solera.com/>
Audatex Datos, S.A. | Avda. de Bruselas, 36, Salida 16, A‑1 (Diversia) , Alcobendas , Madrid , 28108 , Spain
[cid:image790996.png at A70D2A26.F4AADDCB]
On 5/21/2019 6:19 PM, Ken Gaillot wrote:
On Tue, 2019-05-21 at 11:10 +0000, Lopez, Francisco Javier [Global IT]
wrote:
Hello guys !
Need your help to try to understand and debug what I'm facing in one
of my clusters.
I set up fencing with this detail:
# pcs -f stonith_cfg stonith create fence_ao_pg01 fence_vmware_soap
ipaddr=<IP> ssl_insecure=1 login="<User>" passwd="<Passwd>"
pcmk_reboot_action=reboot pcmk_host_list="ao-pg01-p.axadmin.net"
power_wait=3 op monitor interval=60s
# pcs -f stonith_cfg stonith create fence_ao_pg02 fence_vmware_soap
ipaddr=<IP> ssl_insecure=1 login="<User>" passwd="<Passwd>"
pcmk_reboot_action=reboot pcmk_host_list="ao-pg02-p.axadmin.net"
power_wait=3 op monitor interval=60s
# pcs -f stonith_cfg constraint location fence_ao_pg01 avoids ao-
pg01-p.axadmin.net=INFINITY
# pcs -f stonith_cfg constraint location fence_ao_pg02 avoids ao-
pg02-p.axadmin.net=INFINITY
# pcs cluster cib-push stonith_cfg
The pcs status shows all ok during some time and then it turns to:
[root at ao-pg01-p ~]# pcs status --full
Cluster name: ao_cl_p_01
Stack: corosync
Current DC: ao-pg01-p.axadmin.net (1) (version 1.1.19-8.el7_6.4-
c3c624ea3d) - partition with quorum
Last updated: Tue May 21 12:18:46 2019
Last change: Fri May 17 18:54:32 2019 by hacluster via crmd on ao-
pg01-p.axadmin.net
2 nodes configured
3 resources configured
Online: [ ao-pg01-p.axadmin.net (1) ao-pg02-p.axadmin.net (2) ]
Full list of resources:
ao-cl-p-01-vip01 (ocf::heartbeat:IPaddr2): Started ao-pg01-
p.axadmin.net
fence_ao_pg01 (stonith:fence_vmware_soap): Stopped
fence_ao_pg02 (stonith:fence_vmware_soap): Stopped
Node Attributes:
* Node ao-pg01-p.axadmin.net (1):
* Node ao-pg02-p.axadmin.net (2):
Migration Summary:
* Node ao-pg02-p.axadmin.net (2):
fence_ao_pg01: migration-threshold=1000000 fail-count=1000000
last-failure='Sat May 18 00:22:22 2019'
* Node ao-pg01-p.axadmin.net (1):
fence_ao_pg02: migration-threshold=1000000 fail-count=1000000
last-failure='Fri May 17 20:52:53 2019'
Failed Actions:
* fence_ao_pg01_start_0 on ao-pg02-p.axadmin.net 'unknown error' (1):
call=22, status=Timed Out, exitreason='',
last-rc-change='Sat May 18 00:19:49 2019', queued=0ms,
exec=20022ms
* fence_ao_pg02_start_0 on ao-pg01-p.axadmin.net 'unknown error' (1):
call=84, status=Timed Out, exitreason='',
last-rc-change='Fri May 17 20:52:33 2019', queued=0ms,
exec=20032ms
PCSD Status:
ao-pg02-p.axadmin.net: Online
ao-pg01-p.axadmin.net: Online
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
>From the output I see there seems to be a 'Timed Out' but I'd like to
understand if this is a configuration issue
or something else I'm not aware of.
When pacemaker starts a fence device, it issues a monitor command to
the fence agent. That command is what's timing out here.
The first thing I'd try is running the monitor command manually using
the parameters in the device configuration. The fence agent likely has
a debug option you could turn on to get more details.
I'm attaching part of the log that shows the problem related to 17-
May.
Regards
Francisco Javier Lopez IT System Engineer |
Global IT O: +34 619 728 249 | M: +34 619 728 249
|
franciscojavier.lopez at solera.com<mailto:franciscojavier.lopez at solera.com> | Solera.com Aud
atex Datos, S.A. | Avda. de Bruselas, 36, Salida 16, A‑1
(Diversia) , Alcobendas , Madrid , 28108
, Spain
" Este e-mail y sus archivos adjuntos son confidenciales y están
dirigidos exclusivamente a la(s) persona(s) destinataria prevista. Si
ha recibido este mensaje por error, por favor, notifique
inmediatamente al remitente y elimine este mensaje. La empresa no
firma contratos por e-mail y todas las negociaciones están sujetas a
la firma de un contrato por escrito.
This e-mail and any attached files are confidential and intended for
the named addressee(s) only. If you have received this message in
error, please notify the sender and delete the email immediately. The
company does not conclude contracts by email and all negotiations are
subject to written contract. "
_______________________________________________
Manage your subscription:
https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.clusterlabs.org%2Fmailman%2Flistinfo%2Fusers&data=01%7C01%7C%7Cf499cca6634445d48c4008d6de082302%7Cc45b48f313bb448b9356ba7b863c2189%7C1&sdata=iPCgwWckXvP91cmB9NiZD6hYcPujBe6asBDwjG7avG8%3D&reserved=0
ClusterLabs home: https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.clusterlabs.org%2F&data=01%7C01%7C%7Cf499cca6634445d48c4008d6de082302%7Cc45b48f313bb448b9356ba7b863c2189%7C1&sdata=6C%2BVkrMHkAXJK%2FhCXbUbI94zdAwtM4EC4R8tvKdHim8%3D&reserved=0
________________________________
" Este e-mail y sus archivos adjuntos son confidenciales y están dirigidos exclusivamente a la(s) persona(s) destinataria prevista. Si ha recibido este mensaje por error, por favor, notifique inmediatamente al remitente y elimine este mensaje. La empresa no firma contratos por e-mail y todas las negociaciones están sujetas a la firma de un contrato por escrito.
This e-mail and any attached files are confidential and intended for the named addressee(s) only. If you have received this message in error, please notify the sender and delete the email immediately. The company does not conclude contracts by email and all negotiations are subject to written contract. "
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190523/25dec7c4/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image790996.png
Type: image/png
Size: 8543 bytes
Desc: image790996.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190523/25dec7c4/attachment-0001.png>
More information about the Users
mailing list