[Pacemaker] primitive resource start timeout ignored by monitor-operation

Tue Apr 17 07:19:38 EDT 2012

On 04/17/2012 12:41 PM, Rainer Maier wrote:
> hi,
> 
> this is my first post to this list, therefor i ask you to be lenient towards me.
> 
> my problem is, that i configured a primitive resource like this:
> 
> 
> primitive p_fuseesb_cellx ocf:thales:fuseesb \
>         params instance="cell1" fuseesb_home="/usr/lib/fuseesb" 
>             javahome="/usr/lib/jdk1.6.0_31" \
>         op monitor interval="60s" timeout="45s" \
>         op start interval="0" timeout="45s" \
>         op stop interval="0" timeout="20s"
> 
> Now when i start the resource from crm, it gets started, and immediately it gets
>  stopped and restarted. this happens in a cycle every 1-2 seconds.
> 
> inside the corosync-log i get the following output:
> 
> Apr 17 10:48:46 c6 lrmd: [28224]: info: operation start[1538] on p_fuseesb_cellx
>  for client 28227: pid 27751 exited with return code 0
> Apr 17 10:48:46 c6 crmd: [28227]: info: process_lrm_event: LRM operation 
>  p_fuseesb_cellx_start_0 (call=1538, rc=0, cib-update=1633, confirmed=true) ok
> Apr 17 10:48:46 c6 crmd: [28227]: info: do_lrm_rsc_op: Performing 
>  key=1:1017:0:084c0a4a-562e-46b2-bd13-df30802c2bd5 
>  op=p_fuseesb_cellx_monitor_60000 )
> Apr 17 10:48:46 c6 lrmd: [28224]: info: rsc:p_fuseesb_cellx monitor[1539] 
>  (pid 27830)
> Apr 17 10:48:46 c6 lrmd: [28224]: info: operation monitor[1539] on 
>  p_fuseesb_cellx for client 28227: pid 27830 exited with return code 7
> Apr 17 10:48:46 c6 crmd: [28227]: info: process_lrm_event: LRM operation 
>  p_fuseesb_cellx_monitor_60000 (call=1539, rc=7, cib-update=1634, 
>  confirmed=false) 
> not running
> Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_ais_dispatch: Update 
>  relayed from c7
> Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_local_callback: Expanded
>  fail-count-p_fuseesb_cellx=value++ to 225
> Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_trigger_update: Sending flush
>  op to all hosts for: fail-count-p_fuseesb_cellx (225)
> Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_perform_update: Sent update
>  2420: fail-count-p_fuseesb_cellx=225
> Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_ais_dispatch: Update relayed
>  from c7
> Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_trigger_update: Sending flush 
>  op to all hosts for: last-failure-p_fuseesb_cellx (1334652551)
> Apr 17 10:48:46 c6 attrd: [28225]: info: attrd_perform_update: Sent update 
>  2422: last-failure-p_fuseesb_cellx=1334652551
> Apr 17 10:48:46 c6 lrmd: [28224]: info: cancel_op: operation monitor[1539] 
>  on p_fuseesb_cellx for client 28227, its parameters: CRM_meta_name=[monitor] 
> crm_feature_set=[3.0.1] fuseesb_home=[/usr/lib/fuseesb] 
>  CRM_meta_timeout=[45000] CRM_meta_interval=[60000] 
>  javahome=[/usr/lib/jdk1.6.0_31] instance=[cell1]  
> cancelled
> Apr 17 10:48:46 c6 crmd: [28227]: info: do_lrm_rsc_op: Performing 
>  key=2:1019:0:084c0a4a-562e-46b2-bd13-df30802c2bd5 op=p_fuseesb_cellx_stop_0 )
> Apr 17 10:48:46 c6 lrmd: [28224]: info: rsc:p_fuseesb_cellx stop[1540] 
>  (pid 27897)
> Apr 17 10:48:46 c6 crmd: [28227]: info: process_lrm_event: LRM operation 
>  p_fuseesb_cellx_monitor_60000 (call=1539, status=1, cib-update=0, 
>  confirmed=true) 
> Cancelled
> Apr 17 10:48:46 c6 lrmd: [28224]: info: RA output: 
>  (p_fuseesb_cellx:stop:stdout) Stop FUSE ESB: fuse-esb
> 
> 
> from what i can see, the monitor-operation is started immediately after the 
> start-operation. as the start-operation is not finished, the monitor detects 
> that it's not running and therefore, the resource get's immediately stopped 
> and restarted - the circle starts from the beginning.
> 
> what i don't understand is, why does pacemaker ignore the timeouts defined?

You already correctly identified the problem: your resource agent
returns too early on start ... as this is your own RA it should be quite
easy for you to fix that.

The timeouts for start and stop are only the maximum to wait for a
response from the resource agent ... if it returns earlier, fine.

There is a workaround for "buggy" scripts: you could add a "start-delay"
to the monitor operation ... but better fix your script

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> regards
> Rainer
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120417/0ca9552a/attachment-0003.sig>