[ClusterLabs] <EXT>Re: Fencing errors

Lopez, Francisco Javier [Global IT] franciscojavier.lopez at solera.com
Fri May 24 03:57:27 EDT 2019


Hello guys.

Please forget about this issue; I set up a process that asks for the status every 10 secs and I realized
the process takes around 25 secs when it fails. If this helps any other, this is what I did in a loop:

# time fence_vmware_soap --ip xxxx --username "xxxxx" -p "xxxxx" --ssl --ssl-insecure --action status --plug ao-pg02-p.axadmin.net,ao-pg01-p.axadmin.net
Status: ON

real    0m21.999s  <<<---
user    0m15.190s
sys      0m0.294s

The normal execution takes around 14 secs, hence it does not fail.
Since I updated the pcmk_monitor_timeout to 30 the process is running as expected.

Now it's my turn to review why of that difference at vmware.

Thx.
Javier

Francisco Javier​               Lopez
IT System Engineer       |      Global IT
O: +34 619 728 249<tel:+34%20619%20728%20249>    |      M: +34 619 728 249<tel:+34%20619%20728%20249>    |
franciscojavier.lopez at solera.com<mailto:franciscojavier.lopez at solera.com>        |      Solera.com<https://www.solera.com/>
Audatex Datos, S.A.      |      Avda. de Bruselas, 36, Salida 16, A‑1 (Diversia)        ,       Alcobendas      ,       Madrid  ,       28108   ,       Spain
[cid:image630630.png at 5A05821E.C9C08C85]

On 5/23/2019 8:29 PM, Lopez, Francisco Javier [Global IT] wrote:
Hello again Ken et all.

I realized about many things investigating this issue but I feel I need a bit more help from you guys.

It's clear the monitoring process is reporting a timeout. Although I've increased this timeout to 30c using pcmk_monitoring_timeout,
and during this last 2 hours the process did not fail, I'd like to understand more in detail how this process works and if I'm
getting a timeout after 20 secs, it looks to me something else could be happening in my systems.

I tried enabling debug again and, as before, the 'debug' option creates the file but does not update anything unless I enable 'verbose'.
Funny thing because when I enable it, I hit a bug and the fencing does not start:

https://bugzilla.redhat.com/show_bug.cgi?id=1549366<https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1549366&data=01%7C01%7C%7C052e99fae77e4771667008d6dfaca27f%7Cc45b48f313bb448b9356ba7b863c2189%7C1&sdata=R7BIdUDQuFGRCknMramt0zN3E%2ByjrUctVhf7bxpoBpw%3D&reserved=0>

I enabled debug at corosync layer and I got some more information that was nice to better understand this issue but still, not enough
information to narrow down where the issue comes from.

Said this, I'd like to know, if there is a way to review more in detail what the monitoring process is doing like ping, status, etc
and it that time is dedicated to the same action all those secs.

Any idea will be more than welcome.

As always, appreciate your help.

Regards
Javier



Francisco Javier​               Lopez
IT System Engineer       |      Global IT
O: +34 619 728 249<tel:+34%20619%20728%20249>    |      M: +34 619 728 249<tel:+34%20619%20728%20249>    |
franciscojavier.lopez at solera.com<mailto:franciscojavier.lopez at solera.com>        |      Solera.com<https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.solera.com%2F&data=01%7C01%7C%7C052e99fae77e4771667008d6dfaca27f%7Cc45b48f313bb448b9356ba7b863c2189%7C1&sdata=vaYYebi86RJFfIDUlI5UiL2M7UGfv3kgbp%2FE9K8A7UE%3D&reserved=0>
Audatex Datos, S.A.      |      Avda. de Bruselas, 36, Salida 16, A‑1 (Diversia)        ,       Alcobendas      ,       Madrid  ,       28108   ,       Spain
[cid:part6.A6B7221B.10233C2B at solera.com]

On 5/21/2019 6:19 PM, Ken Gaillot wrote:

On Tue, 2019-05-21 at 11:10 +0000, Lopez, Francisco Javier [Global IT]
wrote:


Hello guys !

Need your help to try to understand and debug what I'm facing in one
of my clusters.

I set up fencing with this detail:

# pcs -f stonith_cfg stonith create fence_ao_pg01 fence_vmware_soap
ipaddr=<IP> ssl_insecure=1 login="<User>" passwd="<Passwd>"
pcmk_reboot_action=reboot pcmk_host_list="ao-pg01-p.axadmin.net"
power_wait=3 op monitor interval=60s
# pcs -f stonith_cfg stonith create fence_ao_pg02 fence_vmware_soap
ipaddr=<IP> ssl_insecure=1 login="<User>" passwd="<Passwd>"
pcmk_reboot_action=reboot pcmk_host_list="ao-pg02-p.axadmin.net"
power_wait=3 op monitor interval=60s

# pcs -f stonith_cfg constraint location fence_ao_pg01 avoids ao-
pg01-p.axadmin.net=INFINITY
# pcs -f stonith_cfg constraint location fence_ao_pg02 avoids ao-
pg02-p.axadmin.net=INFINITY

# pcs cluster cib-push stonith_cfg

The pcs status shows all ok during some time and then it turns to:

[root at ao-pg01-p ~]# pcs status --full
Cluster name: ao_cl_p_01
Stack: corosync
Current DC: ao-pg01-p.axadmin.net (1) (version 1.1.19-8.el7_6.4-
c3c624ea3d) - partition with quorum
Last updated: Tue May 21 12:18:46 2019
Last change: Fri May 17 18:54:32 2019 by hacluster via crmd on ao-
pg01-p.axadmin.net

2 nodes configured
3 resources configured

Online: [ ao-pg01-p.axadmin.net (1) ao-pg02-p.axadmin.net (2) ]

Full list of resources:

 ao-cl-p-01-vip01    (ocf::heartbeat:IPaddr2):    Started ao-pg01-
p.axadmin.net
 fence_ao_pg01    (stonith:fence_vmware_soap):    Stopped
 fence_ao_pg02    (stonith:fence_vmware_soap):    Stopped

Node Attributes:
* Node ao-pg01-p.axadmin.net (1):
* Node ao-pg02-p.axadmin.net (2):

Migration Summary:
* Node ao-pg02-p.axadmin.net (2):
   fence_ao_pg01: migration-threshold=1000000 fail-count=1000000
last-failure='Sat May 18 00:22:22 2019'
* Node ao-pg01-p.axadmin.net (1):
   fence_ao_pg02: migration-threshold=1000000 fail-count=1000000
last-failure='Fri May 17 20:52:53 2019'

Failed Actions:
* fence_ao_pg01_start_0 on ao-pg02-p.axadmin.net 'unknown error' (1):
call=22, status=Timed Out, exitreason='',
    last-rc-change='Sat May 18 00:19:49 2019', queued=0ms,
exec=20022ms
* fence_ao_pg02_start_0 on ao-pg01-p.axadmin.net 'unknown error' (1):
call=84, status=Timed Out, exitreason='',
    last-rc-change='Fri May 17 20:52:33 2019', queued=0ms,
exec=20032ms

PCSD Status:
  ao-pg02-p.axadmin.net: Online
  ao-pg01-p.axadmin.net: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled


>From the output I see there seems to be a 'Timed Out' but I'd like to
understand if this is a configuration issue
or something else I'm not aware of.


When pacemaker starts a fence device, it issues a monitor command to
the fence agent. That command is what's timing out here.

The first thing I'd try is running the monitor command manually using
the parameters in the device configuration. The fence agent likely has
a debug option you could turn on to get more details.



I'm attaching part of the log that shows the problem related to 17-
May.

Regards
        Francisco Javier                Lopez     IT System Engineer     |
Global IT         O: +34 619 728 249     |      M: +34 619 728 249
 |
franciscojavier.lopez at solera.com<mailto:franciscojavier.lopez at solera.com>        |      Solera.com        Aud
atex Datos, S.A.         |      Avda. de Bruselas, 36, Salida 16, A‑1
 (Diversia)     ,       Alcobendas      ,       Madrid  ,       28108
        ,       Spain

" Este e-mail y sus archivos adjuntos son confidenciales y están
dirigidos exclusivamente a la(s) persona(s) destinataria prevista. Si
ha recibido este mensaje por error, por favor, notifique
inmediatamente al remitente y elimine este mensaje. La empresa no
firma contratos por e-mail y todas las negociaciones están sujetas a
la firma de un contrato por escrito.

This e-mail and any attached files are confidential and intended for
the named addressee(s) only. If you have received this message in
error, please notify the sender and delete the email immediately. The
company does not conclude contracts by email and all negotiations are
subject to written contract. "
 _______________________________________________
Manage your subscription:
https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.clusterlabs.org%2Fmailman%2Flistinfo%2Fusers&data=01%7C01%7C%7Cf499cca6634445d48c4008d6de082302%7Cc45b48f313bb448b9356ba7b863c2189%7C1&sdata=iPCgwWckXvP91cmB9NiZD6hYcPujBe6asBDwjG7avG8%3D&reserved=0<https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.clusterlabs.org%2Fmailman%2Flistinfo%2Fusers&data=01%7C01%7C%7C052e99fae77e4771667008d6dfaca27f%7Cc45b48f313bb448b9356ba7b863c2189%7C1&sdata=JUBNupk%2Fq8fouSk1z4Pdx0aOLxbn6GARVlb2CaHTpu8%3D&reserved=0>

ClusterLabs home: https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.clusterlabs.org%2F&data=01%7C01%7C%7Cf499cca6634445d48c4008d6de082302%7Cc45b48f313bb448b9356ba7b863c2189%7C1&sdata=6C%2BVkrMHkAXJK%2FhCXbUbI94zdAwtM4EC4R8tvKdHim8%3D&reserved=0<https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.clusterlabs.org%2F&data=01%7C01%7C%7C052e99fae77e4771667008d6dfaca27f%7Cc45b48f313bb448b9356ba7b863c2189%7C1&sdata=gjfr9vYp0tQkgM8XiWfZrLR0MPKiKsS2zvF%2FvVDUHYE%3D&reserved=0>



________________________________

" Este e-mail y sus archivos adjuntos son confidenciales y están dirigidos exclusivamente a la(s) persona(s) destinataria prevista. Si ha recibido este mensaje por error, por favor, notifique inmediatamente al remitente y elimine este mensaje. La empresa no firma contratos por e-mail y todas las negociaciones están sujetas a la firma de un contrato por escrito.

This e-mail and any attached files are confidential and intended for the named addressee(s) only. If you have received this message in error, please notify the sender and delete the email immediately. The company does not conclude contracts by email and all negotiations are subject to written contract. "



_______________________________________________
Manage your subscription:
https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.clusterlabs.org%2Fmailman%2Flistinfo%2Fusers&data=01%7C01%7C%7C052e99fae77e4771667008d6dfaca27f%7Cc45b48f313bb448b9356ba7b863c2189%7C1&sdata=JUBNupk%2Fq8fouSk1z4Pdx0aOLxbn6GARVlb2CaHTpu8%3D&reserved=0

ClusterLabs home: https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.clusterlabs.org%2F&data=01%7C01%7C%7C052e99fae77e4771667008d6dfaca27f%7Cc45b48f313bb448b9356ba7b863c2189%7C1&sdata=gjfr9vYp0tQkgM8XiWfZrLR0MPKiKsS2zvF%2FvVDUHYE%3D&reserved=0


________________________________

" Este e-mail y sus archivos adjuntos son confidenciales y están dirigidos exclusivamente a la(s) persona(s) destinataria prevista. Si ha recibido este mensaje por error, por favor, notifique inmediatamente al remitente y elimine este mensaje. La empresa no firma contratos por e-mail y todas las negociaciones están sujetas a la firma de un contrato por escrito.

This e-mail and any attached files are confidential and intended for the named addressee(s) only. If you have received this message in error, please notify the sender and delete the email immediately. The company does not conclude contracts by email and all negotiations are subject to written contract. "
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190524/f80d0728/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image790996.png
Type: image/png
Size: 8543 bytes
Desc: image790996.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190524/f80d0728/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image630630.png
Type: image/png
Size: 8543 bytes
Desc: image630630.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190524/f80d0728/attachment-0003.png>


More information about the Users mailing list