[ClusterLabs Developers] [ClusterLabs] Issue in fence_ilo4 with IPv6 ILO IPs

Tue Apr 9 01:20:18 EDT 2019

Hi Ondrej,
Yes, you are right. This issue was specific to floating IPs, not with local
IPs.

Post becoming master, I was sending "Neighbor Advertisement" message for my
floating IPs. This was a raw message which was created by me, so I was the
one who was setting flags in it.
Please find attached "image1" which is the message format of NA message.
Attached "image2" which a message capture, as you can see "Override" and
"Solicited" flag both are set. As part of solution, now only "Override" is
set.

Hope I answer your questions. Please let me know any queries.

Thanks,
Rohit

On Mon, Apr 8, 2019 at 6:13 PM Ondrej <ondrej-clusterlabs at famera.cz> wrote:

> On 4/5/19 8:18 PM, Rohit Saini wrote:
> > *Further update on this:*
> > This issue is resolved now. ILO was discarding "Neighbor Advertisement"
> > (NA) as Solicited flag was set in NA message. Hence it was not updating
> > its local neighbor table.
> > As per RFC, Solicited flag should be set only in NA message when it is a
> > response to Neighbor Solicitation.
> > After disabling the Solicited flag in NA message, ILO started updating
> > the local neighbor cache.
>
> Hi Rohit,
>
> Sounds great that after change you get a consistent behaviour. As I had
> not worked with IPv6 for quite some time I wonder how did you disable
> the 'Solicited flag'. Was this done on the OS (cluster node) or on the
> iLO? My guess is the OS but I have no idea how that can be accomplished.
> Can you share which setting you have changed to accomplish this? :)
>
> One additional note the observation here is that you are using the
> "floating IP" that relocated to other machine, while the configuration
> of cluster seems to be not containing any IPaddr2 resources that would
> be representing this address. I would guess that cluster without the
> floating address would not have issue as it would use the addresses
> assigned to the nodes and therefore the mapping between IP address and
> MAC address will be not changing even when the fence_ilo4 resource are
> moving between nodes. If there is intention to use the floating address
> in this cluster I would suggest checking if there is also no issue when
> "not using the floating address" or when it is disabled to see how the
> fence_ilo4 communicates. I think that there might be way in routing
> tables to set which IPv6 address should communicate with iLO IPv6
> address so you get consistent behaviour instead of using the floating IP
> address.
>
> Anyway I'm glad that mystery is resolved.
>
> --
> Ondrej
>
> >
> > On Fri, Apr 5, 2019 at 2:23 PM Rohit Saini
> > <rohitsaini111.forum at gmail.com <mailto:rohitsaini111.forum at gmail.com>>
> > wrote:
> >
> >     Hi Ondrej,
> >     Finally found some lead on this.. We started tcpdump on my machine
> >     to understand the IPMI traffic. Attaching the capture for your
> >     reference.
> >     fd00:1061:37:9021:: is my floating IP and fd00:1061:37:9002:: is my
> >     ILO IP.
> >     When resource movement happens, we are initiating the "Neighbor
> >     Advertisement" for fd00:1061:37:9021:: (which is on new machine now)
> >     so that peers can update their neighbor table and starts
> >     communication with new MAC address.
> >     Looks like ILO is not updating its neighbor table, as it is still
> >     sending responding to older MAC.
> >     After sometime, "Neighbor Solicitation" happens and ILO updates the
> >     neighbor table. Now this ILO becomes reachable and starts responding
> >     towards new MAC address.
> >
> >     My ILO firmware is 2.60. We will try again the issue post upgrading
> >     my firmware.
> >
> >     To verify this theory, after resource movement, I flushed the local
> >     neighbor table due to which "Neighbor Solicitation" was initiated
> >     early and this delay in getting ILO response was not seen.
> >     This fixed the issue.
> >
> >     We are now more interested in understanding why ILO couldnot update
> >     its neighbor table on receiving "Neighbor Advertisement". FYI,
> >     Override flag in "Neighbor Advertisement" is already set.
> >
> >     Thanks,
> >     Rohit
> >
> >     On Thu, Apr 4, 2019 at 8:37 AM Ondrej <ondrej-clusterlabs at famera.cz
> >     <mailto:ondrej-clusterlabs at famera.cz>> wrote:
> >
> >         On 4/3/19 6:10 PM, Rohit Saini wrote:
> >          > Hi Ondrej,
> >          > Please find my reply below:
> >          >
> >          > 1.
> >          > *Stonith configuration:*
> >          > [root at orana ~]# pcs config
> >          >   Resource: fence-uc-orana (class=stonith type=fence_ilo4)
> >          >    Attributes: delay=0 ipaddr=fd00:1061:37:9002:: lanplus=1
> >         login=xyz
> >          > passwd=xyz pcmk_host_list=orana pcmk_reboot_action=off
> >          >    Meta Attrs: failure-timeout=3s
> >          >    Operations: monitor interval=5s on-fail=ignore
> >          > (fence-uc-orana-monitor-interval-5s)
> >          >                start interval=0s on-fail=restart
> >          > (fence-uc-orana-start-interval-0s)
> >          >   Resource: fence-uc-tigana (class=stonith type=fence_ilo4)
> >          >    Attributes: delay=10 ipaddr=fd00:1061:37:9001:: lanplus=1
> >         login=xyz
> >          > passwd=xyz pcmk_host_list=tigana pcmk_reboot_action=off
> >          >    Meta Attrs: failure-timeout=3s
> >          >    Operations: monitor interval=5s on-fail=ignore
> >          > (fence-uc-tigana-monitor-interval-5s)
> >          >                start interval=0s on-fail=restart
> >          > (fence-uc-tigana-start-interval-0s)
> >          >
> >          > Fencing Levels:
> >          >
> >          > Location Constraints:
> >          > Ordering Constraints:
> >          >    start fence-uc-orana then promote unicloud-master
> >         (kind:Mandatory)
> >          >    start fence-uc-tigana then promote unicloud-master
> >         (kind:Mandatory)
> >          > Colocation Constraints:
> >          >    fence-uc-orana with unicloud-master (score:INFINITY)
> >          > (rsc-role:Started) (with-rsc-role:Master)
> >          >    fence-uc-tigana with unicloud-master (score:INFINITY)
> >          > (rsc-role:Started) (with-rsc-role:Master)
> >          >
> >          >
> >          > 2. This is seen randomly. Since I am using colocation,
> >         stonith resources
> >          > are stopped and started on new master. That time, starting of
> >         stonith is
> >          > taking variable amount of time.
> >          > No other IPv6 issues are seen in the cluster nodes.
> >          >
> >          > 3. fence_agent version
> >          >
> >          > [root at orana ~]#  rpm -qa|grep  fence-agents-ipmilan
> >          > fence-agents-ipmilan-4.0.11-66.el7.x86_64
> >          >
> >          >
> >          > *NOTE:*
> >          > Both IPv4 and IPv6 are configured on my ILO, with "iLO Client
> >          > Applications use IPv6 first" turned on.
> >          > Attaching corosync logs also.
> >          >
> >          > Thanks, increasing timeout to 60 worked. But thats not what
> >         exactly I am
> >          > looking for. I need to know exact reason behind delay of
> >         starting these
> >          > IPv6 stonith resources.
> >          >
> >          > Regards,
> >          > Rohit
> >
> >         Hi Rohit,
> >
> >         Thank you for response.
> >
> >           From configuration it is clear that we are using directly IP
> >         addresses
> >         so the DNS resolution issue can be rules out. There are no
> >         messages from
> >         fence_ilo4 that would indicate reason why it timed out. So we
> >         cannot
> >         tell yet what caused the issue. I see that you have enabled
> >         PCMK_debug=stonith-ng most probably (or PCMK_debug=yes),
> >
> >         It is nice that increased the timeout worked, but as said in
> >         previous
> >         email it may just mask the real reason why it takes longer to do
> >         monitor/start operation.
> >
> >           > Both IPv4 and IPv6 are configured on my ILO, with "iLO Client
> >           > Applications use IPv6 first" turned on.
> >         This seems to me to be more related to SNMP communication which
> >         we don't
> >         use with fence_ilo4 as far as I know. We use the ipmitool on
> >         port 623/udp.
> >
> https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00026111en_us&docLocale=en_US#N104B2
> >
> >           > 2. This is seen randomly. Since I am using colocation,
> >         stonith resources
> >           > are stopped and started on new master. That time, starting
> >         of stonith is
> >           > taking variable amount of time.
> >         This is a good observation. Which leads me to question if the
> >         iLO has
> >         set any kind of session limits for the user that is used here.
> >         If there
> >         is any session limit it may be worth trying to increase it and
> >         test if
> >         the same delay can be observed. One situation when this can
> >         happen is
> >         that when one node communicates with iLO and during that time the
> >         communication from other node needs to happen while the limit is
> 1
> >         connection. The relocation of resource from one note to another
> >         might
> >         fit this, but this is just speculation and fastest way to
> >         prove/reject
> >         it would be to increase limit, if there is one, and test it.
> >
> >         # What more can be done to figure out on what is causing delay?
> >
> >         1. The fence_ilo4 can be configured with attribute 'verbose=1'
> >         to print
> >         additional information when it is run. These data looks similar
> >         to ones
> >         below and they seems to provide the timestamps which is great as
> we
> >         should be able to see when what command was run. I don't have a
> >         testing
> >         machine on which to run fence_ilo4 so the below example just
> >         shows how
> >         it looks when it fails on timeout connecting.
> >
> >         Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: notice:
> >         stonith_action_async_done: Child process 4252 performing action
> >         'monitor' timed out with signal 15
> >         Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: warning:
> >         log_action: fence_ilo4[4252] stderr: [ 2019-04-03 12:33:51,193
> INFO:
> >         Executing: /usr/bin/ipmitool -I lanplus -H
> >         fe80::f6bd:8a67:7eb5:214f -p
> >         623 -U xyz -P [set] -L ADMINISTRATOR chassis power status ]
> >         Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: warning:
> >         log_action: fence_ilo4[4252] stderr: [ ]
> >
> >         # pcs stonith update fence-uc-orana verbose=1
> >
> >         Note: That above shows that some private data are included in
> >         logs, so
> >         in case that you have there something interesting for sharing
> >         make sure
> >         to strip out the sensitive data.
> >
> >         2. The version of fence-agents-ipmilan is not the latest when
> >         comparing
> >         that to my CentOS 7.6 system
> >         (fence-agents-ipmilan-4.2.1-11.el7_6.7.x86_64) so you may
> >         consider to
> >         try upgrading the package and see if the latest provided in your
> >         distribution helps by any way if that is possible.
> >
> >         3. You may check if there is any update for the iLO devices and
> >         see if
> >         the updated version exhibits the same behavior with timeouts.
> >          From logs
> >         I cannot tell what version or device the fence_ilo4 is
> >         communicating with.
> >
> >         4. If there is more reliable way for triggering way triggering
> the
> >         situation when the timeout with default 20s is observed you can
> >         setup
> >         network packet capture with tcpdump to see what kind of
> >         communication is
> >         happening during that time. This can help to establish the idea
> >         if there
> >         is any response from the iLO device while we wait which would
> >         indicate
> >         the iLO or network to be issue or if the data arrives fast and
> the
> >         fence_ilo4 doesn't do anything.
> >         - In first case that would point more to network or iLO
> >         communication issue
> >         - In second case that would be more likely issue with fence_ilo4
> or
> >         ipmitool that is used for communication
> >
> >         NOTE: In case that you happen to have a subscription for your
> >         systems
> >         you can try also reaching technical support to look deeper on
> >         collected
> >         data. That way you can save time figuring out how to strip the
> >         private
> >         parts from data before sharing them here.
> >
> >
>  ========================================================================
> >
> >         --
> >         Ondrej
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/developers/attachments/20190409/538c1313/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image1.PNG
Type: image/png
Size: 13634 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/developers/attachments/20190409/538c1313/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image2.PNG
Type: image/png
Size: 42934 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/developers/attachments/20190409/538c1313/attachment-0003.png>