[ClusterLabs] Issue in fence_ilo4 with IPv6 ILO IPs
Rohit Saini
rohitsaini111.forum at gmail.com
Tue Apr 9 01:20:18 EDT 2019
Hi Ondrej,
Yes, you are right. This issue was specific to floating IPs, not with local
IPs.
Post becoming master, I was sending "Neighbor Advertisement" message for my
floating IPs. This was a raw message which was created by me, so I was the
one who was setting flags in it.
Please find attached "image1" which is the message format of NA message.
Attached "image2" which a message capture, as you can see "Override" and
"Solicited" flag both are set. As part of solution, now only "Override" is
set.
Hope I answer your questions. Please let me know any queries.
Thanks,
Rohit
On Mon, Apr 8, 2019 at 6:13 PM Ondrej <ondrej-clusterlabs at famera.cz> wrote:
> On 4/5/19 8:18 PM, Rohit Saini wrote:
> > *Further update on this:*
> > This issue is resolved now. ILO was discarding "Neighbor Advertisement"
> > (NA) as Solicited flag was set in NA message. Hence it was not updating
> > its local neighbor table.
> > As per RFC, Solicited flag should be set only in NA message when it is a
> > response to Neighbor Solicitation.
> > After disabling the Solicited flag in NA message, ILO started updating
> > the local neighbor cache.
>
> Hi Rohit,
>
> Sounds great that after change you get a consistent behaviour. As I had
> not worked with IPv6 for quite some time I wonder how did you disable
> the 'Solicited flag'. Was this done on the OS (cluster node) or on the
> iLO? My guess is the OS but I have no idea how that can be accomplished.
> Can you share which setting you have changed to accomplish this? :)
>
> One additional note the observation here is that you are using the
> "floating IP" that relocated to other machine, while the configuration
> of cluster seems to be not containing any IPaddr2 resources that would
> be representing this address. I would guess that cluster without the
> floating address would not have issue as it would use the addresses
> assigned to the nodes and therefore the mapping between IP address and
> MAC address will be not changing even when the fence_ilo4 resource are
> moving between nodes. If there is intention to use the floating address
> in this cluster I would suggest checking if there is also no issue when
> "not using the floating address" or when it is disabled to see how the
> fence_ilo4 communicates. I think that there might be way in routing
> tables to set which IPv6 address should communicate with iLO IPv6
> address so you get consistent behaviour instead of using the floating IP
> address.
>
> Anyway I'm glad that mystery is resolved.
>
> --
> Ondrej
>
> >
> > On Fri, Apr 5, 2019 at 2:23 PM Rohit Saini
> > <rohitsaini111.forum at gmail.com <mailto:rohitsaini111.forum at gmail.com>>
> > wrote:
> >
> > Hi Ondrej,
> > Finally found some lead on this.. We started tcpdump on my machine
> > to understand the IPMI traffic. Attaching the capture for your
> > reference.
> > fd00:1061:37:9021:: is my floating IP and fd00:1061:37:9002:: is my
> > ILO IP.
> > When resource movement happens, we are initiating the "Neighbor
> > Advertisement" for fd00:1061:37:9021:: (which is on new machine now)
> > so that peers can update their neighbor table and starts
> > communication with new MAC address.
> > Looks like ILO is not updating its neighbor table, as it is still
> > sending responding to older MAC.
> > After sometime, "Neighbor Solicitation" happens and ILO updates the
> > neighbor table. Now this ILO becomes reachable and starts responding
> > towards new MAC address.
> >
> > My ILO firmware is 2.60. We will try again the issue post upgrading
> > my firmware.
> >
> > To verify this theory, after resource movement, I flushed the local
> > neighbor table due to which "Neighbor Solicitation" was initiated
> > early and this delay in getting ILO response was not seen.
> > This fixed the issue.
> >
> > We are now more interested in understanding why ILO couldnot update
> > its neighbor table on receiving "Neighbor Advertisement". FYI,
> > Override flag in "Neighbor Advertisement" is already set.
> >
> > Thanks,
> > Rohit
> >
> > On Thu, Apr 4, 2019 at 8:37 AM Ondrej <ondrej-clusterlabs at famera.cz
> > <mailto:ondrej-clusterlabs at famera.cz>> wrote:
> >
> > On 4/3/19 6:10 PM, Rohit Saini wrote:
> > > Hi Ondrej,
> > > Please find my reply below:
> > >
> > > 1.
> > > *Stonith configuration:*
> > > [root at orana ~]# pcs config
> > > Resource: fence-uc-orana (class=stonith type=fence_ilo4)
> > > Attributes: delay=0 ipaddr=fd00:1061:37:9002:: lanplus=1
> > login=xyz
> > > passwd=xyz pcmk_host_list=orana pcmk_reboot_action=off
> > > Meta Attrs: failure-timeout=3s
> > > Operations: monitor interval=5s on-fail=ignore
> > > (fence-uc-orana-monitor-interval-5s)
> > > start interval=0s on-fail=restart
> > > (fence-uc-orana-start-interval-0s)
> > > Resource: fence-uc-tigana (class=stonith type=fence_ilo4)
> > > Attributes: delay=10 ipaddr=fd00:1061:37:9001:: lanplus=1
> > login=xyz
> > > passwd=xyz pcmk_host_list=tigana pcmk_reboot_action=off
> > > Meta Attrs: failure-timeout=3s
> > > Operations: monitor interval=5s on-fail=ignore
> > > (fence-uc-tigana-monitor-interval-5s)
> > > start interval=0s on-fail=restart
> > > (fence-uc-tigana-start-interval-0s)
> > >
> > > Fencing Levels:
> > >
> > > Location Constraints:
> > > Ordering Constraints:
> > > start fence-uc-orana then promote unicloud-master
> > (kind:Mandatory)
> > > start fence-uc-tigana then promote unicloud-master
> > (kind:Mandatory)
> > > Colocation Constraints:
> > > fence-uc-orana with unicloud-master (score:INFINITY)
> > > (rsc-role:Started) (with-rsc-role:Master)
> > > fence-uc-tigana with unicloud-master (score:INFINITY)
> > > (rsc-role:Started) (with-rsc-role:Master)
> > >
> > >
> > > 2. This is seen randomly. Since I am using colocation,
> > stonith resources
> > > are stopped and started on new master. That time, starting of
> > stonith is
> > > taking variable amount of time.
> > > No other IPv6 issues are seen in the cluster nodes.
> > >
> > > 3. fence_agent version
> > >
> > > [root at orana ~]# rpm -qa|grep fence-agents-ipmilan
> > > fence-agents-ipmilan-4.0.11-66.el7.x86_64
> > >
> > >
> > > *NOTE:*
> > > Both IPv4 and IPv6 are configured on my ILO, with "iLO Client
> > > Applications use IPv6 first" turned on.
> > > Attaching corosync logs also.
> > >
> > > Thanks, increasing timeout to 60 worked. But thats not what
> > exactly I am
> > > looking for. I need to know exact reason behind delay of
> > starting these
> > > IPv6 stonith resources.
> > >
> > > Regards,
> > > Rohit
> >
> > Hi Rohit,
> >
> > Thank you for response.
> >
> > From configuration it is clear that we are using directly IP
> > addresses
> > so the DNS resolution issue can be rules out. There are no
> > messages from
> > fence_ilo4 that would indicate reason why it timed out. So we
> > cannot
> > tell yet what caused the issue. I see that you have enabled
> > PCMK_debug=stonith-ng most probably (or PCMK_debug=yes),
> >
> > It is nice that increased the timeout worked, but as said in
> > previous
> > email it may just mask the real reason why it takes longer to do
> > monitor/start operation.
> >
> > > Both IPv4 and IPv6 are configured on my ILO, with "iLO Client
> > > Applications use IPv6 first" turned on.
> > This seems to me to be more related to SNMP communication which
> > we don't
> > use with fence_ilo4 as far as I know. We use the ipmitool on
> > port 623/udp.
> >
> https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00026111en_us&docLocale=en_US#N104B2
> >
> > > 2. This is seen randomly. Since I am using colocation,
> > stonith resources
> > > are stopped and started on new master. That time, starting
> > of stonith is
> > > taking variable amount of time.
> > This is a good observation. Which leads me to question if the
> > iLO has
> > set any kind of session limits for the user that is used here.
> > If there
> > is any session limit it may be worth trying to increase it and
> > test if
> > the same delay can be observed. One situation when this can
> > happen is
> > that when one node communicates with iLO and during that time the
> > communication from other node needs to happen while the limit is
> 1
> > connection. The relocation of resource from one note to another
> > might
> > fit this, but this is just speculation and fastest way to
> > prove/reject
> > it would be to increase limit, if there is one, and test it.
> >
> > # What more can be done to figure out on what is causing delay?
> >
> > 1. The fence_ilo4 can be configured with attribute 'verbose=1'
> > to print
> > additional information when it is run. These data looks similar
> > to ones
> > below and they seems to provide the timestamps which is great as
> we
> > should be able to see when what command was run. I don't have a
> > testing
> > machine on which to run fence_ilo4 so the below example just
> > shows how
> > it looks when it fails on timeout connecting.
> >
> > Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: notice:
> > stonith_action_async_done: Child process 4252 performing action
> > 'monitor' timed out with signal 15
> > Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: warning:
> > log_action: fence_ilo4[4252] stderr: [ 2019-04-03 12:33:51,193
> INFO:
> > Executing: /usr/bin/ipmitool -I lanplus -H
> > fe80::f6bd:8a67:7eb5:214f -p
> > 623 -U xyz -P [set] -L ADMINISTRATOR chassis power status ]
> > Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: warning:
> > log_action: fence_ilo4[4252] stderr: [ ]
> >
> > # pcs stonith update fence-uc-orana verbose=1
> >
> > Note: That above shows that some private data are included in
> > logs, so
> > in case that you have there something interesting for sharing
> > make sure
> > to strip out the sensitive data.
> >
> > 2. The version of fence-agents-ipmilan is not the latest when
> > comparing
> > that to my CentOS 7.6 system
> > (fence-agents-ipmilan-4.2.1-11.el7_6.7.x86_64) so you may
> > consider to
> > try upgrading the package and see if the latest provided in your
> > distribution helps by any way if that is possible.
> >
> > 3. You may check if there is any update for the iLO devices and
> > see if
> > the updated version exhibits the same behavior with timeouts.
> > From logs
> > I cannot tell what version or device the fence_ilo4 is
> > communicating with.
> >
> > 4. If there is more reliable way for triggering way triggering
> the
> > situation when the timeout with default 20s is observed you can
> > setup
> > network packet capture with tcpdump to see what kind of
> > communication is
> > happening during that time. This can help to establish the idea
> > if there
> > is any response from the iLO device while we wait which would
> > indicate
> > the iLO or network to be issue or if the data arrives fast and
> the
> > fence_ilo4 doesn't do anything.
> > - In first case that would point more to network or iLO
> > communication issue
> > - In second case that would be more likely issue with fence_ilo4
> or
> > ipmitool that is used for communication
> >
> > NOTE: In case that you happen to have a subscription for your
> > systems
> > you can try also reaching technical support to look deeper on
> > collected
> > data. That way you can save time figuring out how to strip the
> > private
> > parts from data before sharing them here.
> >
> >
> ========================================================================
> >
> > --
> > Ondrej
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190409/538c1313/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image1.PNG
Type: image/png
Size: 13634 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190409/538c1313/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image2.PNG
Type: image/png
Size: 42934 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190409/538c1313/attachment-0003.png>
More information about the Users
mailing list