[ClusterLabs] VIP monitoring failing with Timed Out error

Thu Oct 29 05:57:54 EDT 2015

Hi Ken,

When I ran ocf-tester to test IPaddr2 agent

*ocf-tester -n sc_vip -o ip=192.168.20.188 -o cidr_netmask=24 -o nic=eth0
/usr/lib/ocf/resource.d/heartbeat/IPaddr2*

I got this error - *ERROR: Setup problem: couldn't find command: ip
*in *test_command
monitor. *I verified ip command is there but still this error. What might
be the reason for this error ? Is this okay ?

Running: export export  OCF_RESOURCE_INSTANCE=sc_vip
OCF_RESKEY_ip='192.168.20.188' OCF_RESKEY_cidr_netmask='24'
OCF_RESKEY_nic='eth0'; bash /usr/lib/ocf/resource.d/heartbeat/IPaddr2
monitor 2>&1 > /dev/null

command_output: + : /usr/lib/ocf/lib/heartbeat + .
/usr/lib/ocf/lib/heartbeat/ocf-shellfuncs ++ unset LC_ALL ++ export LC_ALL
++ unset LANGUAGE ++ export LANGUAGE +++ basename
/usr/lib/ocf/resource.d/heartbeat/IPaddr2 ++ __SCRIPT_NAME=IPaddr2 ++ '['
-z /usr/lib/ocf ']' ++ '[' /usr/lib/ocf/lib/heartbeat =
/usr/lib/ocf/resource.d/heartbeat ']' ++ : /usr/lib/ocf/lib/heartbeat ++ .
/usr/lib/ocf/lib/heartbeat/ocf-binaries +++
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/sbin:/bin:/usr/sbin:/usr/bin
+++
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/sbin:/bin:/usr/sbin:/usr/bin:/usr/ucb
+++ export PATH +++ : mawk +++ : /bin/grep -E +++ : +++ : mail +++ :
/bin/ping +++ : /bin/bash +++ : /usr/bin/test +++ : /usr/bin/test +++ :
basename +++ : blockdev +++ : cat +++ : fsck +++ : fuser +++ : getent +++ :
grep +++ : ifconfig +++ : iptables +++ : ip +++ : mdadm +++ : modprobe +++
: mount +++ : msgfmt +++ : netstat +++ : perl +++ : python +++ : raidstart
+++ : raidstop +++ : route +++ : umount +++ : reboot +++ : poweroff +++ :
wget +++ : whoami +++ : strings +++ : scp +++ : ssh +++ : swig +++ : gzip
+++ : tar +++ : md5 +++ : drbdadm +++ : drbdsetup ++ .
/usr/lib/ocf/lib/heartbeat/ocf-returncodes +++ OCF_SUCCESS=0 +++
OCF_ERR_GENERIC=1 +++ OCF_ERR_ARGS=2 +++ OCF_ERR_UNIMPLEMENTED=3 +++
OCF_ERR_PERM=4 +++ OCF_ERR_INSTALLED=5 +++ OCF_ERR_CONFIGURED=6 +++
OCF_NOT_RUNNING=7 +++ OCF_RUNNING_MASTER=8 +++ OCF_FAILED_MASTER=9 ++ .
/usr/lib/ocf/lib/heartbeat/ocf-directories +++ prefix=/usr +++
exec_prefix=/usr +++ : /etc/init.d +++ : /etc/ha.d +++ : /etc/ha.d/rc.d +++
: /etc/ha.d/conf +++ : /etc/ha.d/ha.cf +++ : /var/lib/heartbeat +++ :
/var/run/resource-agents +++ : /var/run/heartbeat/rsctmp +++ :
/var/lib/heartbeat/fifo +++ : /usr/lib/heartbeat +++ : /usr/sbin +++ :
%Y/%m/%d_%T +++ : /dev/null +++ : /etc/ha.d/resource.d +++ :
/usr/share/doc/heartbeat +++ : IPaddr2 +++ : /var/run/ +++ :
/var/lock/subsys/ ++ . /usr/lib/ocf/lib/heartbeat/ocf-rarun ++ : 0 ++
__ocf_set_defaults monitor ++ __OCF_ACTION=monitor ++ unset LANG ++
LC_ALL=C ++ export LC_ALL ++ '[' -z '' ']' ++ : 0 ++ '[' '!' -d
/usr/lib/ocf ']' ++ '[' -z '' ']' ++ : IPaddr2 ++ '[' -z '' ']' ++ : We are
being invoked as an init script. ++ : Fill in some things with reasonable
values. ++ : sc_vip ++ return 0 + OCF_RESKEY_lvs_support_default=false +
OCF_RESKEY_clusterip_hash_default=sourceip-sourceport +
OCF_RESKEY_unique_clone_address_default=false +
OCF_RESKEY_arp_interval_default=200 + OCF_RESKEY_arp_count_default=5 +
OCF_RESKEY_arp_bg_default=true + OCF_RESKEY_arp_mac_default=ffffffffffff +
: false + : sourceip-sourceport + : false + : 200 + : 5 + : true + :
ffffffffffff + SENDARP=/usr/lib/heartbeat/send_arp +
FINDIF=/usr/lib/heartbeat/findif + VLDIR=/var/run/resource-agents +
SENDARPPIDDIR=/var/run/resource-agents +
CIP_lockfile=/var/run/resource-agents/IPaddr2-CIP-192.168.20.188 +
ocf_is_true false + case "$1" in + false + case $__OCF_ACTION in +
ip_validate + check_binary ip + have_binary ip + '[' 1 = 1 ']' + false +
'[' 7 = 7 ']' + ocf_log err 'Setup problem: couldn'\''t find command: ip' +
'[' 2 -lt 2 ']' + __OCF_PRIO=err + shift + __OCF_MSG='Setup problem:
couldn'\''t find command: ip' + case "${__OCF_PRIO}" in + __OCF_PRIO=ERROR
+ '[' ERROR = DEBUG ']' + ha_log 'ERROR: Setup problem: couldn'\''t find
command: ip' + local loglevel + '[' none = '' ']' + tty + '[' x = x0 -a x =
xdebug ']' + '[' '' ']' + echo 'ERROR: Setup problem: couldn'\''t find
command: ip' ERROR: Setup problem: couldn't find command: ip + return 0 +
exit 5

On Thu, Oct 29, 2015 at 2:32 PM, Pritam Kharat <
pritam.kharat at oneconvergence.com> wrote:

> Hi Dejan
>
> It is giving following info. Then I tried *crm resource restart sc_vip*
> too but no trace found. Anything which I need to do more apart from this ?
>
> root at sc-node-1:/var/lib/heartbeat# crm resource trace sc_vip stop
> INFO: restart sc_vip to get the trace
>
> On Thu, Oct 29, 2015 at 2:10 PM, Dejan Muhamedagic <dejanmm at fastmail.fm>
> wrote:
>
>> Hi,
>>
>> On Thu, Oct 29, 2015 at 10:40:18AM +0530, Pritam Kharat wrote:
>> > Thank you very much Ken for reply. I will try your suggested steps.
>>
>> If you cannot figure out from the logs why the stop operation
>> times out, you can also try to trace the resource agent:
>>
>> # crm resource help trace
>> # crm resource trace vip stop
>>
>> Then take a look at the trace or post it somewhere.
>>
>> Thanks,
>>
>> Dejan
>>
>> >
>> > On Wed, Oct 28, 2015 at 11:23 PM, Ken Gaillot <kgaillot at redhat.com>
>> wrote:
>> >
>> > > On 10/28/2015 03:51 AM, Pritam Kharat wrote:
>> > > > Hi All,
>> > > >
>> > > > I am facing one issue in my two node HA. When I stop pacemaker on
>> ACTIVE
>> > > > node, it takes more time to stop and by this time VIP migration with
>> > > other
>> > > > resources migration fails to STANDBY node. (I have seen same issue
>> in
>> > > > ACTIVE node reboot case also)
>> > >
>> > > I assume STANDBY in this case is just a description of the node's
>> > > purpose, and does not mean that you placed the node in pacemaker's
>> > > standby mode. If the node really is in standby mode, it can't run any
>> > > resources.
>> > >
>> > > > Last change: Wed Oct 28 02:52:57 2015 via cibadmin on node-1
>> > > > Stack: corosync
>> > > > Current DC: node-1 (1) - partition with quorum
>> > > > Version: 1.1.10-42f2063
>> > > > 2 Nodes configured
>> > > > 2 Resources configured
>> > > >
>> > > >
>> > > > Online: [ node-1 node-2 ]
>> > > >
>> > > > Full list of resources:
>> > > >
>> > > >  resource (upstart:resource): Stopped
>> > > >  vip (ocf::heartbeat:IPaddr2): Started node-2 (unmanaged) FAILED
>> > > >
>> > > > Migration summary:
>> > > > * Node node-1:
>> > > > * Node node-2:
>> > > >
>> > > > Failed actions:
>> > > >     vip_stop_0 (node=node-2, call=-1, rc=1, status=Timed Out,
>> > > > last-rc-change=Wed Oct 28 03:05:24 2015
>> > > > , queued=0ms, exec=0ms
>> > > > ): unknown error
>> > > >
>> > > > VIP monitor is failing over here with error Timed Out. What is the
>> > > general
>> > > > reason for TimeOut. ? I have kept default-action-timeout=180secs
>> which
>> > > > should be enough for monitoring
>> > >
>> > > 180s should be far more than enough, so something must be going wrong.
>> > > Notice that it is the stop operation on the active node that is
>> failing.
>> > > Normally in such a case, pacemaker would fence that node to be sure
>> that
>> > > it is safe to bring it up elsewhere, but you have disabled stonith.
>> > >
>> > > Fencing is important in failure recovery such as this, so it would be
>> a
>> > > good idea to try to get it implemented.
>> > >
>> > > > I have added order property -> when vip is started then only start
>> other
>> > > > resources.
>> > > > Any clue to solve this problem ? Most of the time this VIP
>> monitoring is
>> > > > failing with Timed Out error.
>> > >
>> > > The "stop" in "vip_stop_0" means that the stop operation is what
>> failed.
>> > > Have you seen timeouts on any other operations?
>> > >
>> > > Look through the logs around the time of the failure, and try to see
>> if
>> > > there are any indications as to why the stop failed.
>> > >
>> > > If you can set aside some time for testing or have a test cluster that
>> > > exhibits the same issue, you can try unmanaging the resource in
>> > > pacemaker, then:
>> > >
>> > > 1. Try adding/removing the IP via normal system commands, and make
>> sure
>> > > that works.
>> > >
>> > > 2. Try running the resource agent manually (with any verbose option)
>> to
>> > > start/stop/monitor the IP to see if you can reproduce the problem and
>> > > get more messages.
>> > >
>> > > _______________________________________________
>> > > Users mailing list: Users at clusterlabs.org
>> > > http://clusterlabs.org/mailman/listinfo/users
>> > >
>> > > Project Home: http://www.clusterlabs.org
>> > > Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > > Bugs: http://bugs.clusterlabs.org
>> > >
>> >
>> >
>> >
>> > --
>> > Thanks and Regards,
>> > Pritam Kharat.
>>
>> > _______________________________________________
>> > Users mailing list: Users at clusterlabs.org
>> > http://clusterlabs.org/mailman/listinfo/users
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
>
> --
> Thanks and Regards,
> Pritam Kharat.
>

-- 
Thanks and Regards,
Pritam Kharat.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20151029/3388a13d/attachment-0003.html>