<div dir="ltr">Hi Ondrej,<div>Yes, you are right. This issue was specific to floating IPs, not with local IPs.</div><div><br></div><div>Post becoming master, I was sending "Neighbor Advertisement" message for my floating IPs. This was a raw message which was created by me, so I was the one who was setting flags in it.</div><div>Please find attached "image1" which is the message format of NA message. </div><div>Attached "image2" which a message capture, as you can see "Override" and "Solicited" flag both are set. As part of solution, now only "Override" is set.</div><div><br></div><div>Hope I answer your questions. Please let me know any queries.</div><div><br></div><div>Thanks,</div><div>Rohit</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 8, 2019 at 6:13 PM Ondrej <<a href="mailto:ondrej-clusterlabs@famera.cz">ondrej-clusterlabs@famera.cz</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 4/5/19 8:18 PM, Rohit Saini wrote:<br>
> *Further update on this:*<br>
> This issue is resolved now. ILO was discarding "Neighbor Advertisement" <br>
> (NA) as Solicited flag was set in NA message. Hence it was not updating <br>
> its local neighbor table.<br>
> As per RFC, Solicited flag should be set only in NA message when it is a <br>
> response to Neighbor Solicitation.<br>
> After disabling the Solicited flag in NA message, ILO started updating <br>
> the local neighbor cache.<br>
<br>
Hi Rohit,<br>
<br>
Sounds great that after change you get a consistent behaviour. As I had <br>
not worked with IPv6 for quite some time I wonder how did you disable <br>
the 'Solicited flag'. Was this done on the OS (cluster node) or on the <br>
iLO? My guess is the OS but I have no idea how that can be accomplished.<br>
Can you share which setting you have changed to accomplish this? :)<br>
<br>
One additional note the observation here is that you are using the <br>
"floating IP" that relocated to other machine, while the configuration <br>
of cluster seems to be not containing any IPaddr2 resources that would <br>
be representing this address. I would guess that cluster without the <br>
floating address would not have issue as it would use the addresses <br>
assigned to the nodes and therefore the mapping between IP address and <br>
MAC address will be not changing even when the fence_ilo4 resource are <br>
moving between nodes. If there is intention to use the floating address <br>
in this cluster I would suggest checking if there is also no issue when <br>
"not using the floating address" or when it is disabled to see how the <br>
fence_ilo4 communicates. I think that there might be way in routing <br>
tables to set which IPv6 address should communicate with iLO IPv6 <br>
address so you get consistent behaviour instead of using the floating IP <br>
address.<br>
<br>
Anyway I'm glad that mystery is resolved.<br>
<br>
--<br>
Ondrej<br>
<br>
> <br>
> On Fri, Apr 5, 2019 at 2:23 PM Rohit Saini <br>
> <<a href="mailto:rohitsaini111.forum@gmail.com" target="_blank">rohitsaini111.forum@gmail.com</a> <mailto:<a href="mailto:rohitsaini111.forum@gmail.com" target="_blank">rohitsaini111.forum@gmail.com</a>>> <br>
> wrote:<br>
> <br>
> Hi Ondrej,<br>
> Finally found some lead on this.. We started tcpdump on my machine<br>
> to understand the IPMI traffic. Attaching the capture for your<br>
> reference.<br>
> fd00:1061:37:9021:: is my floating IP and fd00:1061:37:9002:: is my<br>
> ILO IP.<br>
> When resource movement happens, we are initiating the "Neighbor<br>
> Advertisement" for fd00:1061:37:9021:: (which is on new machine now)<br>
> so that peers can update their neighbor table and starts<br>
> communication with new MAC address.<br>
> Looks like ILO is not updating its neighbor table, as it is still<br>
> sending responding to older MAC.<br>
> After sometime, "Neighbor Solicitation" happens and ILO updates the<br>
> neighbor table. Now this ILO becomes reachable and starts responding<br>
> towards new MAC address.<br>
> <br>
> My ILO firmware is 2.60. We will try again the issue post upgrading<br>
> my firmware.<br>
> <br>
> To verify this theory, after resource movement, I flushed the local<br>
> neighbor table due to which "Neighbor Solicitation" was initiated<br>
> early and this delay in getting ILO response was not seen.<br>
> This fixed the issue.<br>
> <br>
> We are now more interested in understanding why ILO couldnot update<br>
> its neighbor table on receiving "Neighbor Advertisement". FYI,<br>
> Override flag in "Neighbor Advertisement" is already set.<br>
> <br>
> Thanks,<br>
> Rohit<br>
> <br>
> On Thu, Apr 4, 2019 at 8:37 AM Ondrej <<a href="mailto:ondrej-clusterlabs@famera.cz" target="_blank">ondrej-clusterlabs@famera.cz</a><br>
> <mailto:<a href="mailto:ondrej-clusterlabs@famera.cz" target="_blank">ondrej-clusterlabs@famera.cz</a>>> wrote:<br>
> <br>
> On 4/3/19 6:10 PM, Rohit Saini wrote:<br>
> > Hi Ondrej,<br>
> > Please find my reply below:<br>
> ><br>
> > 1.<br>
> > *Stonith configuration:*<br>
> > [root@orana ~]# pcs config<br>
> > Resource: fence-uc-orana (class=stonith type=fence_ilo4)<br>
> > Attributes: delay=0 ipaddr=fd00:1061:37:9002:: lanplus=1<br>
> login=xyz<br>
> > passwd=xyz pcmk_host_list=orana pcmk_reboot_action=off<br>
> > Meta Attrs: failure-timeout=3s<br>
> > Operations: monitor interval=5s on-fail=ignore<br>
> > (fence-uc-orana-monitor-interval-5s)<br>
> > start interval=0s on-fail=restart<br>
> > (fence-uc-orana-start-interval-0s)<br>
> > Resource: fence-uc-tigana (class=stonith type=fence_ilo4)<br>
> > Attributes: delay=10 ipaddr=fd00:1061:37:9001:: lanplus=1<br>
> login=xyz<br>
> > passwd=xyz pcmk_host_list=tigana pcmk_reboot_action=off<br>
> > Meta Attrs: failure-timeout=3s<br>
> > Operations: monitor interval=5s on-fail=ignore<br>
> > (fence-uc-tigana-monitor-interval-5s)<br>
> > start interval=0s on-fail=restart<br>
> > (fence-uc-tigana-start-interval-0s)<br>
> ><br>
> > Fencing Levels:<br>
> ><br>
> > Location Constraints:<br>
> > Ordering Constraints:<br>
> > start fence-uc-orana then promote unicloud-master<br>
> (kind:Mandatory)<br>
> > start fence-uc-tigana then promote unicloud-master<br>
> (kind:Mandatory)<br>
> > Colocation Constraints:<br>
> > fence-uc-orana with unicloud-master (score:INFINITY)<br>
> > (rsc-role:Started) (with-rsc-role:Master)<br>
> > fence-uc-tigana with unicloud-master (score:INFINITY)<br>
> > (rsc-role:Started) (with-rsc-role:Master)<br>
> ><br>
> ><br>
> > 2. This is seen randomly. Since I am using colocation,<br>
> stonith resources<br>
> > are stopped and started on new master. That time, starting of<br>
> stonith is<br>
> > taking variable amount of time.<br>
> > No other IPv6 issues are seen in the cluster nodes.<br>
> ><br>
> > 3. fence_agent version<br>
> ><br>
> > [root@orana ~]# rpm -qa|grep fence-agents-ipmilan<br>
> > fence-agents-ipmilan-4.0.11-66.el7.x86_64<br>
> ><br>
> ><br>
> > *NOTE:*<br>
> > Both IPv4 and IPv6 are configured on my ILO, with "iLO Client<br>
> > Applications use IPv6 first" turned on.<br>
> > Attaching corosync logs also.<br>
> ><br>
> > Thanks, increasing timeout to 60 worked. But thats not what<br>
> exactly I am<br>
> > looking for. I need to know exact reason behind delay of<br>
> starting these<br>
> > IPv6 stonith resources.<br>
> ><br>
> > Regards,<br>
> > Rohit<br>
> <br>
> Hi Rohit,<br>
> <br>
> Thank you for response.<br>
> <br>
> From configuration it is clear that we are using directly IP<br>
> addresses<br>
> so the DNS resolution issue can be rules out. There are no<br>
> messages from<br>
> fence_ilo4 that would indicate reason why it timed out. So we<br>
> cannot<br>
> tell yet what caused the issue. I see that you have enabled<br>
> PCMK_debug=stonith-ng most probably (or PCMK_debug=yes),<br>
> <br>
> It is nice that increased the timeout worked, but as said in<br>
> previous<br>
> email it may just mask the real reason why it takes longer to do<br>
> monitor/start operation.<br>
> <br>
> > Both IPv4 and IPv6 are configured on my ILO, with "iLO Client<br>
> > Applications use IPv6 first" turned on.<br>
> This seems to me to be more related to SNMP communication which<br>
> we don't<br>
> use with fence_ilo4 as far as I know. We use the ipmitool on<br>
> port 623/udp.<br>
> <a href="https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00026111en_us&docLocale=en_US#N104B2" rel="noreferrer" target="_blank">https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00026111en_us&docLocale=en_US#N104B2</a><br>
> <br>
> > 2. This is seen randomly. Since I am using colocation,<br>
> stonith resources<br>
> > are stopped and started on new master. That time, starting<br>
> of stonith is<br>
> > taking variable amount of time.<br>
> This is a good observation. Which leads me to question if the<br>
> iLO has<br>
> set any kind of session limits for the user that is used here.<br>
> If there<br>
> is any session limit it may be worth trying to increase it and<br>
> test if<br>
> the same delay can be observed. One situation when this can<br>
> happen is<br>
> that when one node communicates with iLO and during that time the<br>
> communication from other node needs to happen while the limit is 1<br>
> connection. The relocation of resource from one note to another<br>
> might<br>
> fit this, but this is just speculation and fastest way to<br>
> prove/reject<br>
> it would be to increase limit, if there is one, and test it.<br>
> <br>
> # What more can be done to figure out on what is causing delay?<br>
> <br>
> 1. The fence_ilo4 can be configured with attribute 'verbose=1'<br>
> to print<br>
> additional information when it is run. These data looks similar<br>
> to ones<br>
> below and they seems to provide the timestamps which is great as we<br>
> should be able to see when what command was run. I don't have a<br>
> testing<br>
> machine on which to run fence_ilo4 so the below example just<br>
> shows how<br>
> it looks when it fails on timeout connecting.<br>
> <br>
> Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: notice:<br>
> stonith_action_async_done: Child process 4252 performing action<br>
> 'monitor' timed out with signal 15<br>
> Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: warning:<br>
> log_action: fence_ilo4[4252] stderr: [ 2019-04-03 12:33:51,193 INFO:<br>
> Executing: /usr/bin/ipmitool -I lanplus -H<br>
> fe80::f6bd:8a67:7eb5:214f -p<br>
> 623 -U xyz -P [set] -L ADMINISTRATOR chassis power status ]<br>
> Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: warning:<br>
> log_action: fence_ilo4[4252] stderr: [ ]<br>
> <br>
> # pcs stonith update fence-uc-orana verbose=1<br>
> <br>
> Note: That above shows that some private data are included in<br>
> logs, so<br>
> in case that you have there something interesting for sharing<br>
> make sure<br>
> to strip out the sensitive data.<br>
> <br>
> 2. The version of fence-agents-ipmilan is not the latest when<br>
> comparing<br>
> that to my CentOS 7.6 system<br>
> (fence-agents-ipmilan-4.2.1-11.el7_6.7.x86_64) so you may<br>
> consider to<br>
> try upgrading the package and see if the latest provided in your<br>
> distribution helps by any way if that is possible.<br>
> <br>
> 3. You may check if there is any update for the iLO devices and<br>
> see if<br>
> the updated version exhibits the same behavior with timeouts.<br>
> From logs<br>
> I cannot tell what version or device the fence_ilo4 is<br>
> communicating with.<br>
> <br>
> 4. If there is more reliable way for triggering way triggering the<br>
> situation when the timeout with default 20s is observed you can<br>
> setup<br>
> network packet capture with tcpdump to see what kind of<br>
> communication is<br>
> happening during that time. This can help to establish the idea<br>
> if there<br>
> is any response from the iLO device while we wait which would<br>
> indicate<br>
> the iLO or network to be issue or if the data arrives fast and the<br>
> fence_ilo4 doesn't do anything.<br>
> - In first case that would point more to network or iLO<br>
> communication issue<br>
> - In second case that would be more likely issue with fence_ilo4 or<br>
> ipmitool that is used for communication<br>
> <br>
> NOTE: In case that you happen to have a subscription for your<br>
> systems<br>
> you can try also reaching technical support to look deeper on<br>
> collected<br>
> data. That way you can save time figuring out how to strip the<br>
> private<br>
> parts from data before sharing them here.<br>
> <br>
> ========================================================================<br>
> <br>
> --<br>
> Ondrej<br>
> <br>
<br>
</blockquote></div>