<div dir="ltr">Hi Ondrej,<div>Yes, you are right. This issue was specific to floating IPs, not with local IPs.</div><div><br></div><div>Post becoming master, I was sending "Neighbor Advertisement" message for my floating IPs. This was a raw message which was created by me, so I was the one who was setting flags in it.</div><div>Please find attached "image1" which is the message format of NA message. </div><div>Attached "image2" which a message capture, as you can see "Override" and "Solicited" flag both are set. As part of solution, now only "Override" is set.</div><div><br></div><div>Hope I answer your questions. Please let me know any queries.</div><div><br></div><div>Thanks,</div><div>Rohit</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 8, 2019 at 6:13 PM Ondrej <<a href="mailto:ondrej-clusterlabs@famera.cz">ondrej-clusterlabs@famera.cz</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 4/5/19 8:18 PM, Rohit Saini wrote:<br>

> *Further update on this:*<br>

> This issue is resolved now. ILO was discarding "Neighbor Advertisement" <br>

> (NA) as Solicited flag was set in NA message. Hence it was not updating <br>

> its local neighbor table.<br>

> As per RFC, Solicited flag should be set only in NA message when it is a <br>

> response to Neighbor Solicitation.<br>

> After disabling the Solicited flag in NA message, ILO started updating <br>

> the local neighbor cache.<br>

<br>

Hi Rohit,<br>

<br>

Sounds great that after change you get a consistent behaviour. As I had <br>

not worked with IPv6 for quite some time I wonder how did you disable <br>

the 'Solicited flag'. Was this done on the OS (cluster node) or on the <br>

iLO? My guess is the OS but I have no idea how that can be accomplished.<br>

Can you share which setting you have changed to accomplish this? :)<br>

<br>

One additional note the observation here is that you are using the <br>

"floating IP" that relocated to other machine, while the configuration <br>

of cluster seems to be not containing any IPaddr2 resources that would <br>

be representing this address. I would guess that cluster without the <br>

floating address would not have issue as it would use the addresses <br>

assigned to the nodes and therefore the mapping between IP address and <br>

MAC address will be not changing even when the fence_ilo4 resource are <br>

moving between nodes. If there is intention to use the floating address <br>

in this cluster I would suggest checking if there is also no issue when <br>

"not using the floating address" or when it is disabled to see how the <br>

fence_ilo4 communicates. I think that there might be way in routing <br>

tables to set which IPv6 address should communicate with iLO IPv6 <br>

address so you get consistent behaviour instead of using the floating IP <br>

address.<br>

<br>

Anyway I'm glad that mystery is resolved.<br>

<br>

--<br>

Ondrej<br>

<br>

> <br>

> On Fri, Apr 5, 2019 at 2:23 PM Rohit Saini <br>

> <<a href="mailto:rohitsaini111.forum@gmail.com" target="_blank">rohitsaini111.forum@gmail.com</a> <mailto:<a href="mailto:rohitsaini111.forum@gmail.com" target="_blank">rohitsaini111.forum@gmail.com</a>>> <br>

> wrote:<br>

> <br>

>     Hi Ondrej,<br>

>     Finally found some lead on this.. We started tcpdump on my machine<br>

>     to understand the IPMI traffic. Attaching the capture for your<br>

>     reference.<br>

>     fd00:1061:37:9021:: is my floating IP and fd00:1061:37:9002:: is my<br>

>     ILO IP.<br>

>     When resource movement happens, we are initiating the "Neighbor<br>

>     Advertisement" for fd00:1061:37:9021:: (which is on new machine now)<br>

>     so that peers can update their neighbor table and starts<br>

>     communication with new MAC address.<br>

>     Looks like ILO is not updating its neighbor table, as it is still<br>

>     sending responding to older MAC.<br>

>     After sometime, "Neighbor Solicitation" happens and ILO updates the<br>

>     neighbor table. Now this ILO becomes reachable and starts responding<br>

>     towards new MAC address.<br>

> <br>

>     My ILO firmware is 2.60. We will try again the issue post upgrading<br>

>     my firmware.<br>

> <br>

>     To verify this theory, after resource movement, I flushed the local<br>

>     neighbor table due to which "Neighbor Solicitation" was initiated<br>

>     early and this delay in getting ILO response was not seen.<br>

>     This fixed the issue.<br>

> <br>

>     We are now more interested in understanding why ILO couldnot update<br>

>     its neighbor table on receiving "Neighbor Advertisement". FYI,<br>

>     Override flag in "Neighbor Advertisement" is already set.<br>

> <br>

>     Thanks,<br>

>     Rohit<br>

> <br>

>     On Thu, Apr 4, 2019 at 8:37 AM Ondrej <<a href="mailto:ondrej-clusterlabs@famera.cz" target="_blank">ondrej-clusterlabs@famera.cz</a><br>

>     <mailto:<a href="mailto:ondrej-clusterlabs@famera.cz" target="_blank">ondrej-clusterlabs@famera.cz</a>>> wrote:<br>

> <br>

>         On 4/3/19 6:10 PM, Rohit Saini wrote:<br>

>          > Hi Ondrej,<br>

>          > Please find my reply below:<br>

>          ><br>

>          > 1.<br>

>          > *Stonith configuration:*<br>

>          > [root@orana ~]# pcs config<br>

>          >   Resource: fence-uc-orana (class=stonith type=fence_ilo4)<br>

>          >    Attributes: delay=0 ipaddr=fd00:1061:37:9002:: lanplus=1<br>

>         login=xyz<br>

>          > passwd=xyz pcmk_host_list=orana pcmk_reboot_action=off<br>

>          >    Meta Attrs: failure-timeout=3s<br>

>          >    Operations: monitor interval=5s on-fail=ignore<br>

>          > (fence-uc-orana-monitor-interval-5s)<br>

>          >                start interval=0s on-fail=restart<br>

>          > (fence-uc-orana-start-interval-0s)<br>

>          >   Resource: fence-uc-tigana (class=stonith type=fence_ilo4)<br>

>          >    Attributes: delay=10 ipaddr=fd00:1061:37:9001:: lanplus=1<br>

>         login=xyz<br>

>          > passwd=xyz pcmk_host_list=tigana pcmk_reboot_action=off<br>

>          >    Meta Attrs: failure-timeout=3s<br>

>          >    Operations: monitor interval=5s on-fail=ignore<br>

>          > (fence-uc-tigana-monitor-interval-5s)<br>

>          >                start interval=0s on-fail=restart<br>

>          > (fence-uc-tigana-start-interval-0s)<br>

>          ><br>

>          > Fencing Levels:<br>

>          ><br>

>          > Location Constraints:<br>

>          > Ordering Constraints:<br>

>          >    start fence-uc-orana then promote unicloud-master<br>

>         (kind:Mandatory)<br>

>          >    start fence-uc-tigana then promote unicloud-master<br>

>         (kind:Mandatory)<br>

>          > Colocation Constraints:<br>

>          >    fence-uc-orana with unicloud-master (score:INFINITY)<br>

>          > (rsc-role:Started) (with-rsc-role:Master)<br>

>          >    fence-uc-tigana with unicloud-master (score:INFINITY)<br>

>          > (rsc-role:Started) (with-rsc-role:Master)<br>

>          ><br>

>          ><br>

>          > 2. This is seen randomly. Since I am using colocation,<br>

>         stonith resources<br>

>          > are stopped and started on new master. That time, starting of<br>

>         stonith is<br>

>          > taking variable amount of time.<br>

>          > No other IPv6 issues are seen in the cluster nodes.<br>

>          ><br>

>          > 3. fence_agent version<br>

>          ><br>

>          > [root@orana ~]#  rpm -qa|grep  fence-agents-ipmilan<br>

>          > fence-agents-ipmilan-4.0.11-66.el7.x86_64<br>

>          ><br>

>          ><br>

>          > *NOTE:*<br>

>          > Both IPv4 and IPv6 are configured on my ILO, with "iLO Client<br>

>          > Applications use IPv6 first" turned on.<br>

>          > Attaching corosync logs also.<br>

>          ><br>

>          > Thanks, increasing timeout to 60 worked. But thats not what<br>

>         exactly I am<br>

>          > looking for. I need to know exact reason behind delay of<br>

>         starting these<br>

>          > IPv6 stonith resources.<br>

>          ><br>

>          > Regards,<br>

>          > Rohit<br>

> <br>

>         Hi Rohit,<br>

> <br>

>         Thank you for response.<br>

> <br>

>           From configuration it is clear that we are using directly IP<br>

>         addresses<br>

>         so the DNS resolution issue can be rules out. There are no<br>

>         messages from<br>

>         fence_ilo4 that would indicate reason why it timed out. So we<br>

>         cannot<br>

>         tell yet what caused the issue. I see that you have enabled<br>

>         PCMK_debug=stonith-ng most probably (or PCMK_debug=yes),<br>

> <br>

>         It is nice that increased the timeout worked, but as said in<br>

>         previous<br>

>         email it may just mask the real reason why it takes longer to do<br>

>         monitor/start operation.<br>

> <br>

>           > Both IPv4 and IPv6 are configured on my ILO, with "iLO Client<br>

>           > Applications use IPv6 first" turned on.<br>

>         This seems to me to be more related to SNMP communication which<br>

>         we don't<br>

>         use with fence_ilo4 as far as I know. We use the ipmitool on<br>

>         port 623/udp.<br>

>         <a href="https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00026111en_us&docLocale=en_US#N104B2" rel="noreferrer" target="_blank">https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00026111en_us&docLocale=en_US#N104B2</a><br>

> <br>

>           > 2. This is seen randomly. Since I am using colocation,<br>

>         stonith resources<br>

>           > are stopped and started on new master. That time, starting<br>

>         of stonith is<br>

>           > taking variable amount of time.<br>

>         This is a good observation. Which leads me to question if the<br>

>         iLO has<br>

>         set any kind of session limits for the user that is used here.<br>

>         If there<br>

>         is any session limit it may be worth trying to increase it and<br>

>         test if<br>

>         the same delay can be observed. One situation when this can<br>

>         happen is<br>

>         that when one node communicates with iLO and during that time the<br>

>         communication from other node needs to happen while the limit is 1<br>

>         connection. The relocation of resource from one note to another<br>

>         might<br>

>         fit this, but this is just speculation and fastest way to<br>

>         prove/reject<br>

>         it would be to increase limit, if there is one, and test it.<br>

> <br>

>         # What more can be done to figure out on what is causing delay?<br>

> <br>

>         1. The fence_ilo4 can be configured with attribute 'verbose=1'<br>

>         to print<br>

>         additional information when it is run. These data looks similar<br>

>         to ones<br>

>         below and they seems to provide the timestamps which is great as we<br>

>         should be able to see when what command was run. I don't have a<br>

>         testing<br>

>         machine on which to run fence_ilo4 so the below example just<br>

>         shows how<br>

>         it looks when it fails on timeout connecting.<br>

> <br>

>         Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: notice:<br>

>         stonith_action_async_done: Child process 4252 performing action<br>

>         'monitor' timed out with signal 15<br>

>         Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: warning:<br>

>         log_action: fence_ilo4[4252] stderr: [ 2019-04-03 12:33:51,193 INFO:<br>

>         Executing: /usr/bin/ipmitool -I lanplus -H<br>

>         fe80::f6bd:8a67:7eb5:214f -p<br>

>         623 -U xyz -P [set] -L ADMINISTRATOR chassis power status ]<br>

>         Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: warning:<br>

>         log_action: fence_ilo4[4252] stderr: [ ]<br>

> <br>

>         # pcs stonith update fence-uc-orana verbose=1<br>

> <br>

>         Note: That above shows that some private data are included in<br>

>         logs, so<br>

>         in case that you have there something interesting for sharing<br>

>         make sure<br>

>         to strip out the sensitive data.<br>

> <br>

>         2. The version of fence-agents-ipmilan is not the latest when<br>

>         comparing<br>

>         that to my CentOS 7.6 system<br>

>         (fence-agents-ipmilan-4.2.1-11.el7_6.7.x86_64) so you may<br>

>         consider to<br>

>         try upgrading the package and see if the latest provided in your<br>

>         distribution helps by any way if that is possible.<br>

> <br>

>         3. You may check if there is any update for the iLO devices and<br>

>         see if<br>

>         the updated version exhibits the same behavior with timeouts.<br>

>          From logs<br>

>         I cannot tell what version or device the fence_ilo4 is<br>

>         communicating with.<br>

> <br>

>         4. If there is more reliable way for triggering way triggering the<br>

>         situation when the timeout with default 20s is observed you can<br>

>         setup<br>

>         network packet capture with tcpdump to see what kind of<br>

>         communication is<br>

>         happening during that time. This can help to establish the idea<br>

>         if there<br>

>         is any response from the iLO device while we wait which would<br>

>         indicate<br>

>         the iLO or network to be issue or if the data arrives fast and the<br>

>         fence_ilo4 doesn't do anything.<br>

>         - In first case that would point more to network or iLO<br>

>         communication issue<br>

>         - In second case that would be more likely issue with fence_ilo4 or<br>

>         ipmitool that is used for communication<br>

> <br>

>         NOTE: In case that you happen to have a subscription for your<br>

>         systems<br>

>         you can try also reaching technical support to look deeper on<br>

>         collected<br>

>         data. That way you can save time figuring out how to strip the<br>

>         private<br>

>         parts from data before sharing them here.<br>

> <br>

>         ========================================================================<br>

> <br>

>         --<br>

>         Ondrej<br>

> <br>

<br>

</blockquote></div>