[ClusterLabs] corosync taking almost 30 secs to detect node failure in case of kernel panic

Wed Jan 10 16:43:26 UTC 2018

On Wed, 2018-01-10 at 12:43 +0530, ashutosh tiwari wrote:
> Hi,
> 
> We have two node cluster running in active/standby mode and having
> IPMI fencing configured.

Be aware that using on-board IPMI as the only fencing method is
problematic -- if the host loses power, the IPMI will not respond, and
the cluster will be unable to recover.

> In case of kernel panic at Active node, standby node is detecting
> node failure in around 30 secs which leads to delay in standby node
> taking the active role.
> 
> we have totem token timeout as 10000 msecs. 
> Please let us know in case there is any more configuration
> controlling membership detection.

The logs should show what's taking up the time. Corosync should
recognize the node is lost around the token timeout, then pacemaker has
to contact the IPMI and wait for a successful response before
recovering. It could be that the IPMI takes that long to respond, or
there may be something else causing issues.

> 
> s/w versions.
> 
> centos 6.7
> corosync-1.4.7-5.el6.x86_64
> pacemaker-1.1.14-8.el6.x86_64
> 
> Thanks and Regards,
> Ashutosh Tiwari
-- 
Ken Gaillot <kgaillot at redhat.com>