[ClusterLabs Developers] [ClusterLabs] Issue in fence_ilo4 with IPv6 ILO IPs

Tue Apr 2 01:51:56 UTC 2019

On 3/31/19 5:40 AM, Rohit Saini wrote:
> Looking for some help on this.
> 
> Thanks,
> Rohit

Hi Rohit,

As a good start to figure out what is happening here can you please 
provide more detailed information such as:

1. What is the configuration of the stonith device when using IPv4 and 
when using IPv6? ('pcs stonith show --full' - you can obfuscate the 
username and password from that output, the main idea is if you are 
using 'hostname' or 'IP4/6 address here.

2. What does it mean 'sometime' it happens with IPv6? Is there any 
pattern (like every night around 3/4 am, or when there is more traffic 
on network, when we test XXX service, etc.) when this happens or does it 
looks to be happening randomly? Are there any other IPv6 issues present 
on system not related to cluster at time when the timeout is observed?

3. Are there any messages from from fence_ilo4 in the logs 
(/var/log/pacemaker.log, /var/log/cluster/corosync/corosync.log, 
/var/log/messages, ...) around the time when the timeout is reported 
that would suggest what could be happening?

4. Which version of fence_ilo4 are you using?
# rpm -qa|grep  fence-agents-ipmilan
# fence-uc-orana

===
To give you some answers your questions with information provided so far:
 > 1. Why is it happening only for IPv6 ILO devices? Is this some known
 > issue?
Based on the data provided it is not clear where is the issue. Could be 
DNS resolution, could be network issue, ...

 > 2. Can we increase the timeout period "exec=20006ms" to something else.
Yes you can do that and it may hide/"resolve" the issue if the 
fence_ilo4 can finish monitoring in the newly set timeout. You can give 
it a try and increase this to 40 seconds to see if that yields a better 
results in your environment. While the default 20 seconds should be 
enough for majority of environments there might be something requiring 
more time in your case that demands more time. Note that this approach 
might just effectively hide the underlying issue.
To increase the timeout you should increase it for both 'start' and 
'monitor' operation, for example like this:

# pcs stonith update fence-uc-orana op start timeout=40s op monitor 
timeout=40s

--
Ondrej

> 
> On Thu, Mar 28, 2019 at 11:24 AM Rohit Saini 
> <rohitsaini111.forum at gmail.com <mailto:rohitsaini111.forum at gmail.com>> 
> wrote:
> 
>     Hi All,
>     I am trying fence_ilo4 with same ILO device having IPv4 and IPv6
>     address. I see some discrepancy in both the behaviours:
> 
>     *1. When ILO has IPv4 address*
>     This is working fine and stonith resources are started immediately.
> 
>     *2. When ILO has IPv6 address*
>     Starting of stonith resources is taking more than 20 seconds sometime.
> 
>     *[root at tigana ~]# pcs status*
>     Cluster name: ucc
>     Stack: corosync
>     Current DC: tigana (version 1.1.16-12.el7-94ff4df) - partition with
>     quorum
>     Last updated: Wed Mar 27 00:01:37 2019
>     Last change: Wed Mar 27 00:01:19 2019 by root via cibadmin on orana
> 
>     2 nodes configured
>     4 resources configured
> 
>     Online: [ orana tigana ]
> 
>     Full list of resources:
> 
>       Master/Slave Set: unicloud-master [unicloud]
>           Masters: [ orana ]
>           Slaves: [ tigana ]
>       fence-uc-orana (stonith:fence_ilo4):   FAILED orana
>       fence-uc-tigana        (stonith:fence_ilo4):   Started orana
> 
>     Failed Actions:
>     * fence-uc-orana_start_0 on orana 'unknown error' (1): call=32,
>     status=Timed Out, exitreason='none',
>          last-rc-change='Wed Mar 27 00:01:17 2019', queued=0ms,
>     exec=20006ms *<<<<<<<*
> 
> 
>     *Queries:*
>     1. Why is it happening only for IPv6 ILO devices? Is this some known
>     issue?
>     2. Can we increase the timeout period "exec=20006ms" to something else.
> 
> 
>     Thanks,
>     Rohit