[ClusterLabs] VIP monitoring failing with Timed Out error

Pritam Kharat pritam.kharat at oneconvergence.com
Thu Oct 29 05:02:31 EDT 2015


Hi Dejan

It is giving following info. Then I tried *crm resource restart sc_vip* too
but no trace found. Anything which I need to do more apart from this ?

root at sc-node-1:/var/lib/heartbeat# crm resource trace sc_vip stop
INFO: restart sc_vip to get the trace

On Thu, Oct 29, 2015 at 2:10 PM, Dejan Muhamedagic <dejanmm at fastmail.fm>
wrote:

> Hi,
>
> On Thu, Oct 29, 2015 at 10:40:18AM +0530, Pritam Kharat wrote:
> > Thank you very much Ken for reply. I will try your suggested steps.
>
> If you cannot figure out from the logs why the stop operation
> times out, you can also try to trace the resource agent:
>
> # crm resource help trace
> # crm resource trace vip stop
>
> Then take a look at the trace or post it somewhere.
>
> Thanks,
>
> Dejan
>
> >
> > On Wed, Oct 28, 2015 at 11:23 PM, Ken Gaillot <kgaillot at redhat.com>
> wrote:
> >
> > > On 10/28/2015 03:51 AM, Pritam Kharat wrote:
> > > > Hi All,
> > > >
> > > > I am facing one issue in my two node HA. When I stop pacemaker on
> ACTIVE
> > > > node, it takes more time to stop and by this time VIP migration with
> > > other
> > > > resources migration fails to STANDBY node. (I have seen same issue in
> > > > ACTIVE node reboot case also)
> > >
> > > I assume STANDBY in this case is just a description of the node's
> > > purpose, and does not mean that you placed the node in pacemaker's
> > > standby mode. If the node really is in standby mode, it can't run any
> > > resources.
> > >
> > > > Last change: Wed Oct 28 02:52:57 2015 via cibadmin on node-1
> > > > Stack: corosync
> > > > Current DC: node-1 (1) - partition with quorum
> > > > Version: 1.1.10-42f2063
> > > > 2 Nodes configured
> > > > 2 Resources configured
> > > >
> > > >
> > > > Online: [ node-1 node-2 ]
> > > >
> > > > Full list of resources:
> > > >
> > > >  resource (upstart:resource): Stopped
> > > >  vip (ocf::heartbeat:IPaddr2): Started node-2 (unmanaged) FAILED
> > > >
> > > > Migration summary:
> > > > * Node node-1:
> > > > * Node node-2:
> > > >
> > > > Failed actions:
> > > >     vip_stop_0 (node=node-2, call=-1, rc=1, status=Timed Out,
> > > > last-rc-change=Wed Oct 28 03:05:24 2015
> > > > , queued=0ms, exec=0ms
> > > > ): unknown error
> > > >
> > > > VIP monitor is failing over here with error Timed Out. What is the
> > > general
> > > > reason for TimeOut. ? I have kept default-action-timeout=180secs
> which
> > > > should be enough for monitoring
> > >
> > > 180s should be far more than enough, so something must be going wrong.
> > > Notice that it is the stop operation on the active node that is
> failing.
> > > Normally in such a case, pacemaker would fence that node to be sure
> that
> > > it is safe to bring it up elsewhere, but you have disabled stonith.
> > >
> > > Fencing is important in failure recovery such as this, so it would be a
> > > good idea to try to get it implemented.
> > >
> > > > I have added order property -> when vip is started then only start
> other
> > > > resources.
> > > > Any clue to solve this problem ? Most of the time this VIP
> monitoring is
> > > > failing with Timed Out error.
> > >
> > > The "stop" in "vip_stop_0" means that the stop operation is what
> failed.
> > > Have you seen timeouts on any other operations?
> > >
> > > Look through the logs around the time of the failure, and try to see if
> > > there are any indications as to why the stop failed.
> > >
> > > If you can set aside some time for testing or have a test cluster that
> > > exhibits the same issue, you can try unmanaging the resource in
> > > pacemaker, then:
> > >
> > > 1. Try adding/removing the IP via normal system commands, and make sure
> > > that works.
> > >
> > > 2. Try running the resource agent manually (with any verbose option) to
> > > start/stop/monitor the IP to see if you can reproduce the problem and
> > > get more messages.
> > >
> > > _______________________________________________
> > > Users mailing list: Users at clusterlabs.org
> > > http://clusterlabs.org/mailman/listinfo/users
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > >
> >
> >
> >
> > --
> > Thanks and Regards,
> > Pritam Kharat.
>
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Thanks and Regards,
Pritam Kharat.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20151029/340db6c4/attachment-0003.html>


More information about the Users mailing list