[ClusterLabs] VIP monitoring failing with Timed Out error

Wed Oct 28 13:53:19 EDT 2015

On 10/28/2015 03:51 AM, Pritam Kharat wrote:
> Hi All,
> 
> I am facing one issue in my two node HA. When I stop pacemaker on ACTIVE
> node, it takes more time to stop and by this time VIP migration with other
> resources migration fails to STANDBY node. (I have seen same issue in
> ACTIVE node reboot case also)

I assume STANDBY in this case is just a description of the node's
purpose, and does not mean that you placed the node in pacemaker's
standby mode. If the node really is in standby mode, it can't run any
resources.

> Last change: Wed Oct 28 02:52:57 2015 via cibadmin on node-1
> Stack: corosync
> Current DC: node-1 (1) - partition with quorum
> Version: 1.1.10-42f2063
> 2 Nodes configured
> 2 Resources configured
> 
> 
> Online: [ node-1 node-2 ]
> 
> Full list of resources:
> 
>  resource (upstart:resource): Stopped
>  vip (ocf::heartbeat:IPaddr2): Started node-2 (unmanaged) FAILED
> 
> Migration summary:
> * Node node-1:
> * Node node-2:
> 
> Failed actions:
>     vip_stop_0 (node=node-2, call=-1, rc=1, status=Timed Out,
> last-rc-change=Wed Oct 28 03:05:24 2015
> , queued=0ms, exec=0ms
> ): unknown error
> 
> VIP monitor is failing over here with error Timed Out. What is the general
> reason for TimeOut. ? I have kept default-action-timeout=180secs which
> should be enough for monitoring

180s should be far more than enough, so something must be going wrong.
Notice that it is the stop operation on the active node that is failing.
Normally in such a case, pacemaker would fence that node to be sure that
it is safe to bring it up elsewhere, but you have disabled stonith.

Fencing is important in failure recovery such as this, so it would be a
good idea to try to get it implemented.

> I have added order property -> when vip is started then only start other
> resources.
> Any clue to solve this problem ? Most of the time this VIP monitoring is
> failing with Timed Out error.

The "stop" in "vip_stop_0" means that the stop operation is what failed.
Have you seen timeouts on any other operations?

Look through the logs around the time of the failure, and try to see if
there are any indications as to why the stop failed.

If you can set aside some time for testing or have a test cluster that
exhibits the same issue, you can try unmanaging the resource in
pacemaker, then:

1. Try adding/removing the IP via normal system commands, and make sure
that works.

2. Try running the resource agent manually (with any verbose option) to
start/stop/monitor the IP to see if you can reproduce the problem and
get more messages.