[Pacemaker] Resources not failing over, ERROR: RecurringOp: Invalid recurring action ... wth name: 'start'

Wed Jul 2 00:07:32 EDT 2014

Hi,

I'm puppetizing resource deployment for pacemaker and corosync, and as part
of it, am creating a resource on one of three nodes of a cluster. The
problem is that I'm seeing RecurringOp errors during resource creation,
which are probably not allowing failover a resource. The resource creation
seems to go through fine, but these recurringOp errors always result after
resource creation (I'm pasting outputs of two different commands below):

***************************

vagrant at precise64b:/vagrant/puppet-environments/modules/f5_lbaas/tests$
sudo crm status

============

Last updated: Wed Jul  2 03:52:30 2014

Last change: Wed Jul  2 03:38:20 2014 via cibadmin on precise64b

Stack: cman

Current DC: precise64b - partition with quorum

Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c

3 Nodes configured, unknown expected votes

3 Resources configured.

============

Online: [ precise64b precise64c precise64a ]

 f5-lbaas-agent-10.6.143.121_resource
(lsb:f5-lbaas-agent-10.6.143.121): Started
precise64c

 f5-lbaas-agent-10.6.143.122_resource
(lsb:f5-lbaas-agent-10.6.143.122): Started
precise64b

 f5-lbaas-agent-10.6.143.123_resource
(lsb:f5-lbaas-agent-10.6.143.123): Started
precise64b

Failed actions:

    f5-lbaas-agent-10.6.143.120_resource_monitor_0 (node=precise64b,
call=2, rc=5, status=complete): not installed

    f5-lbaas-agent-10.6.143.121_resource_monitor_0 (node=precise64b,
call=3, rc=5, status=complete): not installed

    f5-lbaas-agent-10.6.143.122_resource_monitor_0 (node=precise64c,
call=7, rc=5, status=complete): not installed

    f5-lbaas-agent-10.6.143.123_resource_monitor_0 (node=precise64c,
call=8, rc=5, status=complete): not installed

    f5-lbaas-agent-10.6.143.120_resource_monitor_0 (node=precise64a,
call=2, rc=5, status=complete): not installed

    f5-lbaas-agent-10.6.143.121_resource_monitor_0 (node=precise64a,
call=3, rc=5, status=complete): not installed

    f5-lbaas-agent-10.6.143.122_resource_monitor_0 (node=precise64a,
call=4, rc=5, status=complete): not installed

    f5-lbaas-agent-10.6.143.123_resource_monitor_0 (node=precise64a,
call=5, rc=5, status=complete): not installed

vagrant at precise64b:/vagrant/puppet-environments/modules/f5_lbaas/tests$

***************************

vagrant at precise64b:/vagrant/puppet-environments/modules/f5_lbaas/tests$
sudo crm_verify -L -V

crm_verify[15183]: 2014/07/02_03:39:13 ERROR: RecurringOp: Invalid
recurring action f5-lbaas-agent-10.6.143.121_resource-start-10 wth name:
'start'

crm_verify[15183]: 2014/07/02_03:39:13 ERROR: RecurringOp: Invalid
recurring action f5-lbaas-agent-10.6.143.121_resource-stop-10 wth name:
'stop'

crm_verify[15183]: 2014/07/02_03:39:13 ERROR: RecurringOp: Invalid
recurring action f5-lbaas-agent-10.6.143.122_resource-start-10 wth name:
'start'

crm_verify[15183]: 2014/07/02_03:39:13 ERROR: RecurringOp: Invalid
recurring action f5-lbaas-agent-10.6.143.122_resource-stop-10 wth name:
'stop'

Errors found during check: config not valid

vagrant at precise64b:/vagrant/puppet-environments/modules/f5_lbaas/tests$

***************************

What do these errors signify? I found one email exchange on a pacemaker ML
that suggested that we shouldn't be using start intervals and timeouts, and
same with stop, since that would mean that pacemaker would attempt to
restart the resource every x seconds, timeout every y seconds, and repeat
that. (Link:
http://lists.linbit.com/pipermail/drbd-user/2011-September/016938.html)

My understanding was that the start interval would apply in case of restart
attempts upon detection of a resource as being down. Nevertheless, I
removed these parameters and created a third resource (the first two, I
created with these parameters), and I still see the same monitor related
errors for the third resource (
f5-lbaas-agent-10.6.143.123_resource_monitor_0) in the sudo crm status
command output. I don't however understand why this resource doesn't show
up in the crm_verify -L -V output.

Here are the two CLIs I use to create the resources:

sudo crm configure primitive $pmk_res_name $pmk_cont_type:$service_name op
monitor interval="$mon_interval" timeout="$mon_timeout" op start
interval="$start_interval" timeout="$start_timeout" op stop
interval="$stop_interval" timeout="$stop_timeout

sudo crm configure primitive $pmk_res_name $pmk_cont_type:$service_name op
monitor interval="$mon_interval" timeout="$mon_timeout"

The bottom-line is that if I halt the VM running any of these resources,
the resource isn't failing over to another VM. I'm not sure what the exact
cause is - any help would be greatly appreciated!

Thanks,

Regards,

Vijay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140701/e2bba1cd/attachment-0002.html>