<div dir="ltr">try crm ra test nginx lb02 start<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">2013/10/22 Lucas Brown <span dir="ltr"><<a href="mailto:lucas@locatrix.com" target="_blank">lucas@locatrix.com</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hey guys,<div><br></div><div>I'm encountering a really strange problem testing failover of my ocf:heartbeat:nginx resource in my 2 node cluster. I am able to manually migrate the resource around the nodes and that works fine, but I can't get the resource to function on one node while the other has encountered a failure. The strange part is this only happens if the failure was on node 1. If I reproduce the failure on node 2 the resource will correctly failover to node 1.</div>
<div><br></div><div>no-quorum-policy is ignore, so that doesn't seem to be the issue, and some similar threads mentioned start-failure-is-fatal=false may help, but it doesn't resolve it either. I have a more advanced configuration that includes a Virtual IP and ping clones, and those parts seem to work fine, and nginx even failsover correctly when its host goes offline completely. Just can't get the same behaviour to work when only the resource fails.</div>
<div><br></div><div>My test case:</div><div><br></div><div>>vim /etc/nginx/nginx.conf</div><div>>Insert invalid jargon and save</div><div>>service nginx restart</div><div><br></div><div>Expected outcome: Resource failsover to the other node upon monitor failure in either direction between my 2 nodes.</div>
<div>Actual: Resource failsover correctly from node 2 -> node 1, but not node 1 -> node 2.</div><div><br></div><div>This is my test configuration for reproducing the issue (to make sure my other stuff isn't interfering).</div>
<div>-----------------------</div><div><div>node $id="724150464" lb01</div><div>node $id="740927680" lb02</div><div>primitive nginx ocf:heartbeat:nginx \</div><div><span style="white-space:pre-wrap"> </span>params configfile="/etc/nginx/nginx.conf" \</div>
<div><span style="white-space:pre-wrap"> </span>op monitor interval="10s" timeout="30s" depth="0" \</div><div><span style="white-space:pre-wrap"> </span>op monitor interval="15s" timeout="30s" status10url="<a href="http://localhost/nginx_status" target="_blank">http://localhost/nginx_status</a>" depth="10"</div>
<div>property $id="cib-bootstrap-options" \</div><div><span style="white-space:pre-wrap"> </span>dc-version="1.1.10-42f2063" \</div><div><span style="white-space:pre-wrap"> </span>cluster-infrastructure="corosync" \</div>
<div><span style="white-space:pre-wrap"> </span>stonith-enabled="false" \</div><div><span style="white-space:pre-wrap"> </span>no-quorum-policy="ignore" \</div><div><span style="white-space:pre-wrap"> </span>start-failure-is-fatal="false" \</div>
<div><span style="white-space:pre-wrap"> </span>last-lrm-refresh="1382410708"</div><div>rsc_defaults $id="rsc-options" \</div><div><span style="white-space:pre-wrap"> </span>resource-stickiness="100"</div>
</div><div><br></div><div>This is what happens when I perform the test case on node lb02, and it correctly migrates/restarts the resource on lb01.</div><div><div>-----------------------</div><div>Oct 22 11:58:12 [694] lb02 pengine: warning: unpack_rsc_op: <span style="white-space:pre-wrap"> </span>Processing failed op monitor for nginx on lb02: not running (7)</div>
<div>Oct 22 11:58:12 [694] lb02 pengine: info: native_print: <span style="white-space:pre-wrap"> </span>nginx<span style="white-space:pre-wrap"> </span>(ocf::heartbeat:nginx):<span style="white-space:pre-wrap"> </span>Started lb02 FAILED </div>
<div>Oct 22 11:58:12 [694] lb02 pengine: info: RecurringOp: <span style="white-space:pre-wrap"> </span> Start recurring monitor (10s) for nginx on lb02</div><div>Oct 22 11:58:12 [694] lb02 pengine: info: RecurringOp: <span style="white-space:pre-wrap"> </span> Start recurring monitor (15s) for nginx on lb02</div>
<div>Oct 22 11:58:12 [694] lb02 pengine: notice: LogActions: <span style="white-space:pre-wrap"> </span>Recover nginx<span style="white-space:pre-wrap"> </span>(Started lb02)</div><div>Oct 22 11:58:12 [690] lb02 cib: info: cib_process_request: <span style="white-space:pre-wrap"> </span>Completed cib_query operation for section //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']: No such device or address (rc=-6, origin=local/attrd/1038, version=0.252.2)</div>
<div>Oct 22 11:58:12 [692] lb02 lrmd: info: cancel_recurring_action: <span style="white-space:pre-wrap"> </span>Cancelling operation nginx_monitor_15000</div><div>Oct 22 11:58:12 [692] lb02 lrmd: info: cancel_recurring_action: <span style="white-space:pre-wrap"> </span>Cancelling operation nginx_monitor_10000</div>
<div>Oct 22 11:58:12 [692] lb02 lrmd: info: log_execute: <span style="white-space:pre-wrap"> </span>executing - rsc:nginx action:stop call_id:848</div><div>Oct 22 11:58:12 [695] lb02 crmd: info: process_lrm_event: <span style="white-space:pre-wrap"> </span>LRM operation nginx_monitor_15000 (call=839, status=1, cib-update=0, confirmed=true) Cancelled</div>
<div>Oct 22 11:58:12 [695] lb02 crmd: info: process_lrm_event: <span style="white-space:pre-wrap"> </span>LRM operation nginx_monitor_10000 (call=841, status=1, cib-update=0, confirmed=true) Cancelled</div><div>
Oct 22 11:58:12 [690] lb02 cib: info: cib_process_request: <span style="white-space:pre-wrap"> </span>Completed cib_query operation for section //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='last-failure-nginx']: OK (rc=0, origin=local/attrd/1041, version=0.252.3)</div>
<div>nginx[31237]:<span style="white-space:pre-wrap"> </span>2013/10/22_11:58:12 INFO: nginx is not running.</div><div>Oct 22 11:58:12 [692] lb02 lrmd: info: log_finished: <span style="white-space:pre-wrap"> </span>finished - rsc:nginx action:stop call_id:848 pid:31237 exit-code:0 exec-time:155ms queue-time:0ms</div>
<div>Oct 22 11:58:12 [695] lb02 crmd: notice: process_lrm_event: <span style="white-space:pre-wrap"> </span>LRM operation nginx_stop_0 (call=848, rc=0, cib-update=593, confirmed=true) ok</div><div>Oct 22 11:58:12 [694] lb02 pengine: info: unpack_rsc_op: <span style="white-space:pre-wrap"> </span>Operation monitor found resource nginx active on lb01</div>
<div>Oct 22 11:58:12 [694] lb02 pengine: warning: unpack_rsc_op: <span style="white-space:pre-wrap"> </span>Processing failed op monitor for nginx on lb02: not running (7)</div><div>Oct 22 11:58:12 [694] lb02 pengine: info: native_print: <span style="white-space:pre-wrap"> </span>nginx<span style="white-space:pre-wrap"> </span>(ocf::heartbeat:nginx):<span style="white-space:pre-wrap"> </span>Stopped </div>
<div>Oct 22 11:58:12 [694] lb02 pengine: info: get_failcount_full: <span style="white-space:pre-wrap"> </span>nginx has failed 1 times on lb02</div><div>Oct 22 11:58:12 [694] lb02 pengine: info: common_apply_stickiness: <span style="white-space:pre-wrap"> </span>nginx can fail 999999 more times on lb02 before being forced off</div>
<div>Oct 22 11:58:12 [694] lb02 pengine: info: RecurringOp: <span style="white-space:pre-wrap"> </span> Start recurring monitor (10s) for nginx on lb01</div><div>Oct 22 11:58:12 [694] lb02 pengine: info: RecurringOp: <span style="white-space:pre-wrap"> </span> Start recurring monitor (15s) for nginx on lb01</div>
<div>Oct 22 11:58:12 [694] lb02 pengine: notice: LogActions: <span style="white-space:pre-wrap"> </span>Start nginx<span style="white-space:pre-wrap"> </span>(lb01)</div></div><div><br></div><div><br></div>
<div>This is what happens when I try to go from lb01 -> lb02.</div><div>-----------------------<br></div><div><div>Oct 22 12:00:25 [694] lb02 pengine: warning: unpack_rsc_op: <span style="white-space:pre-wrap"> </span>Processing failed op monitor for nginx on lb01: not running (7)</div>
<div>Oct 22 12:00:25 [694] lb02 pengine: info: unpack_rsc_op: <span style="white-space:pre-wrap"> </span>Operation monitor found resource nginx active on lb02</div><div>Oct 22 12:00:25 [694] lb02 pengine: info: native_print: <span style="white-space:pre-wrap"> </span>nginx<span style="white-space:pre-wrap"> </span>(ocf::heartbeat:nginx):<span style="white-space:pre-wrap"> </span>Started lb01 FAILED </div>
<div>Oct 22 12:00:25 [694] lb02 pengine: info: RecurringOp: <span style="white-space:pre-wrap"> </span> Start recurring monitor (10s) for nginx on lb01</div><div>Oct 22 12:00:25 [694] lb02 pengine: info: RecurringOp: <span style="white-space:pre-wrap"> </span> Start recurring monitor (15s) for nginx on lb01</div>
<div>Oct 22 12:00:25 [694] lb02 pengine: notice: LogActions: <span style="white-space:pre-wrap"> </span>Recover nginx<span style="white-space:pre-wrap"> </span>(Started lb01)</div><div>Oct 22 12:00:25 [690] lb02 cib: info: cib_process_request: <span style="white-space:pre-wrap"> </span>Completed cib_query operation for section //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']: No such device or address (rc=-6, origin=local/attrd/1046, version=0.253.12)</div>
<div>Oct 22 12:00:25 [690] lb02 cib: info: cib_process_request: <span style="white-space:pre-wrap"> </span>Completed cib_query operation for section //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='last-failure-nginx']: OK (rc=0, origin=local/attrd/1047, version=0.253.12)</div>
<div>Oct 22 12:00:25 [694] lb02 pengine: warning: unpack_rsc_op: <span style="white-space:pre-wrap"> </span>Processing failed op monitor for nginx on lb01: not running (7)</div><div>Oct 22 12:00:25 [694] lb02 pengine: info: unpack_rsc_op: <span style="white-space:pre-wrap"> </span>Operation monitor found resource nginx active on lb02</div>
<div>Oct 22 12:00:25 [694] lb02 pengine: info: native_print: <span style="white-space:pre-wrap"> </span>nginx<span style="white-space:pre-wrap"> </span>(ocf::heartbeat:nginx):<span style="white-space:pre-wrap"> </span>Stopped </div>
<div>Oct 22 12:00:25 [694] lb02 pengine: info: get_failcount_full: <span style="white-space:pre-wrap"> </span>nginx has failed 1 times on lb01</div><div>Oct 22 12:00:25 [694] lb02 pengine: info: common_apply_stickiness: <span style="white-space:pre-wrap"> </span>nginx can fail 999999 more times on lb01 before being forced off</div>
<div>Oct 22 12:00:25 [694] lb02 pengine: info: RecurringOp: <span style="white-space:pre-wrap"> </span> Start recurring monitor (10s) for nginx on lb01</div><div>Oct 22 12:00:25 [694] lb02 pengine: info: RecurringOp: <span style="white-space:pre-wrap"> </span> Start recurring monitor (15s) for nginx on lb01</div>
<div>Oct 22 12:00:25 [694] lb02 pengine: notice: LogActions: <span style="white-space:pre-wrap"> </span>Start nginx<span style="white-space:pre-wrap"> </span>(lb01)</div><div>Oct 22 12:00:25 [694] lb02 pengine: error: unpack_rsc_op: <span style="white-space:pre-wrap"> </span>Preventing nginx from re-starting anywhere in the cluster : operation start failed 'not configured' (rc=6)</div>
<div>Oct 22 12:00:25 [694] lb02 pengine: warning: unpack_rsc_op: <span style="white-space:pre-wrap"> </span>Processing failed op start for nginx on lb01: not configured (6)</div><div>Oct 22 12:00:25 [694] lb02 pengine: info: unpack_rsc_op: <span style="white-space:pre-wrap"> </span>Operation monitor found resource nginx active on lb02</div>
<div>Oct 22 12:00:25 [694] lb02 pengine: info: native_print: <span style="white-space:pre-wrap"> </span>nginx<span style="white-space:pre-wrap"> </span>(ocf::heartbeat:nginx):<span style="white-space:pre-wrap"> </span>Started lb01 FAILED </div>
<div>Oct 22 12:00:25 [694] lb02 pengine: info: get_failcount_full: <span style="white-space:pre-wrap"> </span>nginx has failed 1 times on lb01</div><div>Oct 22 12:00:25 [694] lb02 pengine: info: common_apply_stickiness: <span style="white-space:pre-wrap"> </span>nginx can fail 999999 more times on lb01 before being forced off</div>
<div>Oct 22 12:00:25 [694] lb02 pengine: info: native_color: <span style="white-space:pre-wrap"> </span>Resource nginx cannot run anywhere</div><div>Oct 22 12:00:25 [694] lb02 pengine: notice: LogActions: <span style="white-space:pre-wrap"> </span>Stop nginx<span style="white-space:pre-wrap"> </span>(lb01)</div>
<div>Oct 22 12:00:26 [690] lb02 cib: info: cib_process_request: <span style="white-space:pre-wrap"> </span>Completed cib_query operation for section //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']: No such device or address (rc=-6, origin=local/attrd/1049, version=0.253.15)</div>
<div>Oct 22 12:00:26 [690] lb02 cib: info: cib_process_request: <span style="white-space:pre-wrap"> </span>Completed cib_query operation for section //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']: No such device or address (rc=-6, origin=local/attrd/1050, version=0.253.15)</div>
<div>Oct 22 12:00:26 [694] lb02 pengine: error: unpack_rsc_op: <span style="white-space:pre-wrap"> </span>Preventing nginx from re-starting anywhere in the cluster : operation start failed 'not configured' (rc=6)</div>
<div>Oct 22 12:00:26 [694] lb02 pengine: warning: unpack_rsc_op: <span style="white-space:pre-wrap"> </span>Processing failed op start for nginx on lb01: not configured (6)</div><div>Oct 22 12:00:26 [694] lb02 pengine: info: unpack_rsc_op: <span style="white-space:pre-wrap"> </span>Operation monitor found resource nginx active on lb02</div>
<div>Oct 22 12:00:26 [694] lb02 pengine: info: native_print: <span style="white-space:pre-wrap"> </span>nginx<span style="white-space:pre-wrap"> </span>(ocf::heartbeat:nginx):<span style="white-space:pre-wrap"> </span>Stopped </div>
<div>Oct 22 12:00:26 [694] lb02 pengine: info: get_failcount_full: <span style="white-space:pre-wrap"> </span>nginx has failed INFINITY times on lb01</div><div>Oct 22 12:00:26 [694] lb02 pengine: warning: common_apply_stickiness: <span style="white-space:pre-wrap"> </span>Forcing nginx away from lb01 after 1000000 failures (max=1000000)</div>
<div>Oct 22 12:00:26 [694] lb02 pengine: info: native_color: <span style="white-space:pre-wrap"> </span>Resource nginx cannot run anywhere</div><div>Oct 22 12:00:26 [694] lb02 pengine: info: LogActions: <span style="white-space:pre-wrap"> </span>Leave nginx<span style="white-space:pre-wrap"> </span>(Stopped)</div>
</div><div><br></div><div>I can't for the life of me work out why this is happening. For whatever reason in node 1 -> node 2, it randomly decides that the resource can no longer run anywhere.</div><div><br></div><div>
And yes, I am making sure everything works before I start each test, so its not failure to use crm resource cleanup etc.</div><div><br></div><div>Would really appreciate help on this as I've been trying to debug this for a few days and have hit a wall.</div>
<div><br></div><div>Thanks,</div><div><br></div><div>Lucas</div></div>
<br>_______________________________________________<br>
Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
<br></blockquote></div><br><br clear="all"><br>-- <br>esta es mi vida e me la vivo hasta que dios quiera
</div>