<div dir="ltr">Hi Ondrej,<div>Finally found some lead on this.. We started tcpdump on my machine to understand the IPMI traffic. Attaching the capture for your reference.</div><div>fd00:1061:37:9021:: is my floating IP and fd00:1061:37:9002:: is my ILO IP.</div><div>When resource movement happens, we are initiating the "Neighbor Advertisement" for fd00:1061:37:9021:: (which is on new machine now) so that peers can update their neighbor table and starts communication with new MAC address.</div><div>Looks like ILO is not updating its neighbor table, as it is still sending responding to older MAC.</div><div>After sometime, "Neighbor Solicitation" happens and ILO updates the neighbor table. Now this ILO becomes reachable and starts responding towards new MAC address.</div><div><br></div><div>My ILO firmware is 2.60. We will try again the issue post upgrading my firmware.</div><div><br></div><div>To verify this theory, after resource movement, I flushed the local neighbor table due to which "Neighbor Solicitation" was initiated early and this delay in getting ILO response was not seen.</div><div>This fixed the issue.</div><div><br></div><div>We are now more interested in understanding why ILO couldnot update its neighbor table on receiving "Neighbor Advertisement". FYI, Override flag in "Neighbor Advertisement" is already set.</div><div><br></div><div>Thanks,</div><div>Rohit</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Apr 4, 2019 at 8:37 AM Ondrej <<a href="mailto:ondrej-clusterlabs@famera.cz">ondrej-clusterlabs@famera.cz</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 4/3/19 6:10 PM, Rohit Saini wrote:<br>

> Hi Ondrej,<br>

> Please find my reply below:<br>

> <br>

> 1.<br>

> *Stonith configuration:*<br>

> [root@orana ~]# pcs config<br>

>   Resource: fence-uc-orana (class=stonith type=fence_ilo4)<br>

>    Attributes: delay=0 ipaddr=fd00:1061:37:9002:: lanplus=1 login=xyz <br>

> passwd=xyz pcmk_host_list=orana pcmk_reboot_action=off<br>

>    Meta Attrs: failure-timeout=3s<br>

>    Operations: monitor interval=5s on-fail=ignore <br>

> (fence-uc-orana-monitor-interval-5s)<br>

>                start interval=0s on-fail=restart <br>

> (fence-uc-orana-start-interval-0s)<br>

>   Resource: fence-uc-tigana (class=stonith type=fence_ilo4)<br>

>    Attributes: delay=10 ipaddr=fd00:1061:37:9001:: lanplus=1 login=xyz <br>

> passwd=xyz pcmk_host_list=tigana pcmk_reboot_action=off<br>

>    Meta Attrs: failure-timeout=3s<br>

>    Operations: monitor interval=5s on-fail=ignore <br>

> (fence-uc-tigana-monitor-interval-5s)<br>

>                start interval=0s on-fail=restart <br>

> (fence-uc-tigana-start-interval-0s)<br>

> <br>

> Fencing Levels:<br>

> <br>

> Location Constraints:<br>

> Ordering Constraints:<br>

>    start fence-uc-orana then promote unicloud-master (kind:Mandatory)<br>

>    start fence-uc-tigana then promote unicloud-master (kind:Mandatory)<br>

> Colocation Constraints:<br>

>    fence-uc-orana with unicloud-master (score:INFINITY) <br>

> (rsc-role:Started) (with-rsc-role:Master)<br>

>    fence-uc-tigana with unicloud-master (score:INFINITY) <br>

> (rsc-role:Started) (with-rsc-role:Master)<br>

> <br>

> <br>

> 2. This is seen randomly. Since I am using colocation, stonith resources <br>

> are stopped and started on new master. That time, starting of stonith is <br>

> taking variable amount of time.<br>

> No other IPv6 issues are seen in the cluster nodes.<br>

> <br>

> 3. fence_agent version<br>

> <br>

> [root@orana ~]#  rpm -qa|grep  fence-agents-ipmilan<br>

> fence-agents-ipmilan-4.0.11-66.el7.x86_64<br>

> <br>

> <br>

> *NOTE:*<br>

> Both IPv4 and IPv6 are configured on my ILO, with "iLO Client <br>

> Applications use IPv6 first" turned on.<br>

> Attaching corosync logs also.<br>

> <br>

> Thanks, increasing timeout to 60 worked. But thats not what exactly I am <br>

> looking for. I need to know exact reason behind delay of starting these <br>

> IPv6 stonith resources.<br>

> <br>

> Regards,<br>

> Rohit<br>

<br>

Hi Rohit,<br>

<br>

Thank you for response.<br>

<br>

 From configuration it is clear that we are using directly IP addresses <br>

so the DNS resolution issue can be rules out. There are no messages from <br>

fence_ilo4 that would indicate reason why it timed out. So we cannot <br>

tell yet what caused the issue. I see that you have enabled <br>

PCMK_debug=stonith-ng most probably (or PCMK_debug=yes),<br>

<br>

It is nice that increased the timeout worked, but as said in previous <br>

email it may just mask the real reason why it takes longer to do <br>

monitor/start operation.<br>

<br>

 > Both IPv4 and IPv6 are configured on my ILO, with "iLO Client<br>

 > Applications use IPv6 first" turned on.<br>

This seems to me to be more related to SNMP communication which we don't <br>

use with fence_ilo4 as far as I know. We use the ipmitool on port 623/udp.<br>

<a href="https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00026111en_us&docLocale=en_US#N104B2" rel="noreferrer" target="_blank">https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00026111en_us&docLocale=en_US#N104B2</a><br>

<br>

 > 2. This is seen randomly. Since I am using colocation, stonith resources<br>

 > are stopped and started on new master. That time, starting of stonith is<br>

 > taking variable amount of time.<br>

This is a good observation. Which leads me to question if the iLO has <br>

set any kind of session limits for the user that is used here. If there <br>

is any session limit it may be worth trying to increase it and test if <br>

the same delay can be observed. One situation when this can happen is <br>

that when one node communicates with iLO and during that time the <br>

communication from other node needs to happen while the limit is 1 <br>

connection. The relocation of resource from one note to another might <br>

fit this, but this is just speculation and fastest way to prove/reject <br>

it would be to increase limit, if there is one, and test it.<br>

<br>

# What more can be done to figure out on what is causing delay?<br>

<br>

1. The fence_ilo4 can be configured with attribute 'verbose=1' to print <br>

additional information when it is run. These data looks similar to ones <br>

below and they seems to provide the timestamps which is great as we <br>

should be able to see when what command was run. I don't have a testing <br>

machine on which to run fence_ilo4 so the below example just shows how <br>

it looks when it fails on timeout connecting.<br>

<br>

Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: notice:<br>

stonith_action_async_done: Child process 4252 performing action<br>

'monitor' timed out with signal 15<br>

Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: warning:<br>

log_action: fence_ilo4[4252] stderr: [ 2019-04-03 12:33:51,193 INFO:<br>

Executing: /usr/bin/ipmitool -I lanplus -H fe80::f6bd:8a67:7eb5:214f -p<br>

623 -U xyz -P [set] -L ADMINISTRATOR chassis power status ]<br>

Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: warning:<br>

log_action: fence_ilo4[4252] stderr: [ ]<br>

<br>

# pcs stonith update fence-uc-orana verbose=1<br>

<br>

Note: That above shows that some private data are included in logs, so <br>

in case that you have there something interesting for sharing make sure <br>

to strip out the sensitive data.<br>

<br>

2. The version of fence-agents-ipmilan is not the latest when comparing <br>

that to my CentOS 7.6 system <br>

(fence-agents-ipmilan-4.2.1-11.el7_6.7.x86_64) so you may consider to <br>

try upgrading the package and see if the latest provided in your <br>

distribution helps by any way if that is possible.<br>

<br>

3. You may check if there is any update for the iLO devices and see if <br>

the updated version exhibits the same behavior with timeouts. From logs <br>

I cannot tell what version or device the fence_ilo4 is communicating with.<br>

<br>

4. If there is more reliable way for triggering way triggering the <br>

situation when the timeout with default 20s is observed you can setup <br>

network packet capture with tcpdump to see what kind of communication is <br>

happening during that time. This can help to establish the idea if there <br>

is any response from the iLO device while we wait which would indicate <br>

the iLO or network to be issue or if the data arrives fast and the <br>

fence_ilo4 doesn't do anything.<br>

- In first case that would point more to network or iLO communication issue<br>

- In second case that would be more likely issue with fence_ilo4 or <br>

ipmitool that is used for communication<br>

<br>

NOTE: In case that you happen to have a subscription for your systems <br>

you can try also reaching technical support to look deeper on collected <br>

data. That way you can save time figuring out how to strip the private <br>

parts from data before sharing them here.<br>

<br>

========================================================================<br>

<br>

--<br>

Ondrej<br>

</blockquote></div>