[Pacemaker] Why monitor fails in my RA

Fri Apr 27 10:18:38 EDT 2012

On day 04/27/12 03:43, Andrew Beekhof wrote:

[cut]
>> My mointor function (simplified and removed overhead and added some
>> comments) is:
>> redis_monitor() {
>>         # I set score 10 for master 5 is for slave
>>         CURSCORE=`$CRM_MASTER -G -q`
>>         logger "redis_monitor: score $CURSCORE"
>>         local state
>>         redis_state
>>
>>         # In RET is current local redis state
>>         state=$(echo "${RET}" | cut -d':' -f2 | tr -d '\r')
>>
>>         if [ "${state}" = "master" ];then
>>                 $CRM_MASTER -v $CRM_MASTER_SCORE # score is 10
>>                 exit $OCF_RUNNING_MASTER
>>         fi
>>
>>         if [ "${state}" = "slave" ];then
>>                 $CRM_MASTER -v $CRM_SLAVE_SCORE # score is 5
>>                 exit $OCF_SUCCESS
>>         fi
>>
>>         # if not slave/master so resource is failed
>>         $CRM_MASTER -l reboot -D
>>         if [ $CURSCORE -eq $CRM_MASTER_SCORE ];then
>>                 exit $OCF_FAILED_MASTER
>>         fi
>>
>>         exit $OCF_NOT_RUNNING
>
> Are you sure its NOT_RUNNING?
> Could it also be running but generally failed?
redis_state function set RET variable to master for Master /slave for 
slave /empty for not running or hanging. If redis is not master nor 
slave so it should be restarted.

>>  From my logs I know that monitoring function returned OCF_FAILED_MASTER when
>> master is down and then this error occurred:
>> redis-server:0_monitor_5000 (node=s1, call=16, rc=9, status=complete):
>> master (failed)
>>
>> After that failed master node is not monitored on that node until I run
>> cleanup:
>> #crm resource cleanup redis-server:0
>>
>>
>> My questions:
>> 1) What I'm doing wrong ?. How can I fix this.
>> I've tried on-fail="restart" but this not helped
>
> You'd need to supply more information (in the form of a hb_report tarball).
> An upgrade might not hurt either.
No new version for debina in squeeze-backports :(

Rapport attached. I've change monitor a little bit and now check state 
using OCF_RESKEY_CRM_meta_role but I still has the same problems. Test 
scenario is running master on s1 and after a while i kill redis on s1 . 
S2 became master and after almost 2 minutes I do the same on s2 - kill 
redis process. Redis on S1 became master. (that all in report).

After kill redis on S2 error occurs:
redis-server:0_monitor_5000 (node=s1, call=16, rc=9, status=complete): 
master (failed)
Now if S2 became master redis on that node is never monitored again (if 
S2 is slave for redis). It's very strange that this error never happen 
if I kill redis for the first time on S1.

redis_monitor() {

         local CURSTATE
	local state
         # One can use (undocumented ?)
         #OCF_RESKEY_CRM_meta_role=Slave
         #OCF_RESKEY_CRM_meta_role=Master
         CURSTATE=$(echo ${OCF_RESKEY_CRM_meta_role} | tr [A-Z] [a-z])

         logger "redis_monitor: current state: $CURSTATE"

	# check redis state
         redis_state
         state=$(echo "${RET}" | cut -d':' -f2 | tr -d '\r')
         logger  "redis_monitor: redis state $state"

	# CRM says redis is master:
         if [ "${CURSTATE}" = "master" ];then
                 if [ "${state}" = "master" ];then
                         logger "redis_monitor 1 $OCF_RUNNING_MASTER"
                         $CRM_MASTER -v $CRM_MASTER_SCORE
                         exit $OCF_RUNNING_MASTER
                 else
                         logger "redis_monitor: CRM says master but 
redis says other thing"
                         $CRM_MASTER -D
                         exit $OCF_FAILED_MASTER
                 fi
         fi

	# CRM says redis is slave:
         if [ "${CURSTATE}" = "slave" ];then
                 if [ "${state}" = "slave" ];then
                         logger "redis_monitor 2 $OCF_SUCCESS"
                         # TODO - w przyszlosci dodatkowe testy np. 
zapis odczy klucza/sprawdzenie czy replikacja dziala itp.
                         $CRM_MASTER -v $CRM_SLAVE_SCORE
                         exit $OCF_SUCCESS
                 else
                         logger "redis_monitor: CRM says slave but redis 
says other thing"
                         $CRM_MASTER -D
                         exit $OCF_NOT_RUNNING
                 fi

         fi

         # State not defined (not in master-slave state)
         if [ "${CURSTATE}" = "" ];then
                 if [ "${state}" = "" ];then
                         logger "redis_monitor pre-end $OCF_NOT_RUNNING"
                         $CRM_MASTER -D
                         exit $OCF_NOT_RUNNING
                 else
                         logger "redis_monitor pre-end $OCF_SUCCESS"
                         $CRM_MASTER -v $CRM_SLAVE_SCORE
                         exit  $OCF_SUCCESS
                 fi
         fi

         # It's impossible to get here but safe to keep it
         $CRM_MASTER -D
         logger "redis_monitor end $OCF_NOT_RUNNING"
         exit $OCF_NOT_RUNNING
}

>
>>
>> 2) Using older version of redis 2.3 If master failed redis is hanging for
>> some time (21-24 seconds). Even I set higher timeout on monitor functions it
>> still timeout after 20 seconds why?.
>
> How did you set the timeout higher?
>
By setting:
default-action-timeout="60s"

I think that monitor timeout should be sufficient but operation was 
stopped afeter. Error like that:
Apr 26 16:25:37 SREVERXXX lrmd: [18777]: debug: on_msg_perform_op: add 
an operation operation monitor[3] on ocf::redis::redis-serv:0 for client 
18780, its parameters: vservers=[redis-2,redis-1] 
CRM_meta_master_max=[1] CRM_meta_timeout=[20000] CRM_meta_clone_max=[2] 
CRM_meta_master_node_max=[1] crm_feature_set=[3.0.1] 
CRM_meta_globally_unique=[false] masterip=[X.X.X.X] CRM_meta_clone=[0] 
CRM_meta_clone_node_max=[1] CRM_meta_notify=[false]  to the operation list.

Another question is that: What value should demote function return if 
node (master) is down. I return OCF_NOT_RUNNING and get this failed:
Failed actions:
     redis-server:0_demote_0 (node=s1, call=61, rc=7, status=complete): 
not running

--
Greg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: report2.tar.bz2
Type: application/x-bzip2
Size: 65281 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120427/64ee1c91/attachment-0003.bz2>