[Pacemaker] Resource only failsover in one direction

Tue Oct 22 01:27:00 EDT 2013

try crm ra test nginx lb02 start


2013/10/22 Lucas Brown <lucas at locatrix.com>

> Hey guys,
>
> I'm encountering a really strange problem testing failover of my
> ocf:heartbeat:nginx resource in my 2 node cluster. I am able to manually
> migrate the resource around the nodes and that works fine, but I can't get
> the resource to function on one node while the other has encountered a
> failure. The strange part is this only happens if the failure was on node
> 1. If I reproduce the failure on node 2 the resource will correctly
> failover to node 1.
>
> no-quorum-policy is ignore, so that doesn't seem to be the issue, and some
> similar threads mentioned start-failure-is-fatal=false may help, but it
> doesn't resolve it either. I have a more advanced configuration that
> includes a Virtual IP and ping clones, and those parts seem to work fine,
> and nginx even failsover correctly when its host goes offline completely.
> Just can't get the same behaviour to work when only the resource fails.
>
> My test case:
>
> >vim /etc/nginx/nginx.conf
> >Insert invalid jargon and save
> >service nginx restart
>
> Expected outcome: Resource failsover to the other node upon monitor
> failure in either direction between my 2 nodes.
> Actual: Resource failsover correctly from node 2 -> node 1, but not node 1
> -> node 2.
>
> This is my test configuration for reproducing the issue (to make sure my
> other stuff isn't interfering).
> -----------------------
> node $id="724150464" lb01
> node $id="740927680" lb02
> primitive nginx ocf:heartbeat:nginx \
> params configfile="/etc/nginx/nginx.conf" \
>  op monitor interval="10s" timeout="30s" depth="0" \
> op monitor interval="15s" timeout="30s" status10url="
> http://localhost/nginx_status" depth="10"
> property $id="cib-bootstrap-options" \
> dc-version="1.1.10-42f2063" \
> cluster-infrastructure="corosync" \
>  stonith-enabled="false" \
> no-quorum-policy="ignore" \
> start-failure-is-fatal="false" \
>  last-lrm-refresh="1382410708"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="100"
>
> This is what happens when I perform the test case on node lb02, and it
> correctly migrates/restarts the resource on lb01.
> -----------------------
> Oct 22 11:58:12 [694] lb02    pengine:  warning: unpack_rsc_op: Processing
> failed op monitor for nginx on lb02: not running (7)
> Oct 22 11:58:12 [694] lb02    pengine:     info: native_print: nginx
> (ocf::heartbeat:nginx): Started lb02 FAILED
> Oct 22 11:58:12 [694] lb02    pengine:     info: RecurringOp:  Start
> recurring monitor (10s) for nginx on lb02
> Oct 22 11:58:12 [694] lb02    pengine:     info: RecurringOp:  Start
> recurring monitor (15s) for nginx on lb02
> Oct 22 11:58:12 [694] lb02    pengine:   notice: LogActions: Recover nginx (Started
> lb02)
> Oct 22 11:58:12 [690] lb02        cib:     info: cib_process_request: Completed
> cib_query operation for section
> //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']:
> No such device or address (rc=-6, origin=local/attrd/1038, version=0.252.2)
> Oct 22 11:58:12 [692] lb02       lrmd:     info: cancel_recurring_action: Cancelling
> operation nginx_monitor_15000
> Oct 22 11:58:12 [692] lb02       lrmd:     info: cancel_recurring_action: Cancelling
> operation nginx_monitor_10000
> Oct 22 11:58:12 [692] lb02       lrmd:     info: log_execute: executing -
> rsc:nginx action:stop call_id:848
> Oct 22 11:58:12 [695] lb02       crmd:     info: process_lrm_event: LRM
> operation nginx_monitor_15000 (call=839, status=1, cib-update=0,
> confirmed=true) Cancelled
> Oct 22 11:58:12 [695] lb02       crmd:     info: process_lrm_event: LRM
> operation nginx_monitor_10000 (call=841, status=1, cib-update=0,
> confirmed=true) Cancelled
> Oct 22 11:58:12 [690] lb02        cib:     info: cib_process_request: Completed
> cib_query operation for section
> //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='last-failure-nginx']:
> OK (rc=0, origin=local/attrd/1041, version=0.252.3)
> nginx[31237]: 2013/10/22_11:58:12 INFO: nginx is not running.
> Oct 22 11:58:12 [692] lb02       lrmd:     info: log_finished: finished -
> rsc:nginx action:stop call_id:848 pid:31237 exit-code:0 exec-time:155ms
> queue-time:0ms
> Oct 22 11:58:12 [695] lb02       crmd:   notice: process_lrm_event: LRM
> operation nginx_stop_0 (call=848, rc=0, cib-update=593, confirmed=true) ok
> Oct 22 11:58:12 [694] lb02    pengine:     info: unpack_rsc_op: Operation
> monitor found resource nginx active on lb01
> Oct 22 11:58:12 [694] lb02    pengine:  warning: unpack_rsc_op: Processing
> failed op monitor for nginx on lb02: not running (7)
> Oct 22 11:58:12 [694] lb02    pengine:     info: native_print: nginx
> (ocf::heartbeat:nginx): Stopped
> Oct 22 11:58:12 [694] lb02    pengine:     info: get_failcount_full: nginx
> has failed 1 times on lb02
> Oct 22 11:58:12 [694] lb02    pengine:     info: common_apply_stickiness: nginx
> can fail 999999 more times on lb02 before being forced off
> Oct 22 11:58:12 [694] lb02    pengine:     info: RecurringOp:  Start
> recurring monitor (10s) for nginx on lb01
> Oct 22 11:58:12 [694] lb02    pengine:     info: RecurringOp:  Start
> recurring monitor (15s) for nginx on lb01
> Oct 22 11:58:12 [694] lb02    pengine:   notice: LogActions: Start   nginx
> (lb01)
>
>
> This is what happens when I try to go from lb01 -> lb02.
> -----------------------
> Oct 22 12:00:25 [694] lb02    pengine:  warning: unpack_rsc_op: Processing
> failed op monitor for nginx on lb01: not running (7)
> Oct 22 12:00:25 [694] lb02    pengine:     info: unpack_rsc_op: Operation
> monitor found resource nginx active on lb02
> Oct 22 12:00:25 [694] lb02    pengine:     info: native_print: nginx
> (ocf::heartbeat:nginx): Started lb01 FAILED
> Oct 22 12:00:25 [694] lb02    pengine:     info: RecurringOp:  Start
> recurring monitor (10s) for nginx on lb01
> Oct 22 12:00:25 [694] lb02    pengine:     info: RecurringOp:  Start
> recurring monitor (15s) for nginx on lb01
> Oct 22 12:00:25 [694] lb02    pengine:   notice: LogActions: Recover nginx (Started
> lb01)
> Oct 22 12:00:25 [690] lb02        cib:     info: cib_process_request: Completed
> cib_query operation for section
> //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']:
> No such device or address (rc=-6, origin=local/attrd/1046, version=0.253.12)
> Oct 22 12:00:25 [690] lb02        cib:     info: cib_process_request: Completed
> cib_query operation for section
> //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='last-failure-nginx']:
> OK (rc=0, origin=local/attrd/1047, version=0.253.12)
> Oct 22 12:00:25 [694] lb02    pengine:  warning: unpack_rsc_op: Processing
> failed op monitor for nginx on lb01: not running (7)
> Oct 22 12:00:25 [694] lb02    pengine:     info: unpack_rsc_op: Operation
> monitor found resource nginx active on lb02
> Oct 22 12:00:25 [694] lb02    pengine:     info: native_print: nginx
> (ocf::heartbeat:nginx): Stopped
> Oct 22 12:00:25 [694] lb02    pengine:     info: get_failcount_full: nginx
> has failed 1 times on lb01
> Oct 22 12:00:25 [694] lb02    pengine:     info: common_apply_stickiness: nginx
> can fail 999999 more times on lb01 before being forced off
> Oct 22 12:00:25 [694] lb02    pengine:     info: RecurringOp:  Start
> recurring monitor (10s) for nginx on lb01
> Oct 22 12:00:25 [694] lb02    pengine:     info: RecurringOp:  Start
> recurring monitor (15s) for nginx on lb01
> Oct 22 12:00:25 [694] lb02    pengine:   notice: LogActions: Start   nginx
> (lb01)
> Oct 22 12:00:25 [694] lb02    pengine:    error: unpack_rsc_op: Preventing
> nginx from re-starting anywhere in the cluster : operation start failed
> 'not configured' (rc=6)
> Oct 22 12:00:25 [694] lb02    pengine:  warning: unpack_rsc_op: Processing
> failed op start for nginx on lb01: not configured (6)
> Oct 22 12:00:25 [694] lb02    pengine:     info: unpack_rsc_op: Operation
> monitor found resource nginx active on lb02
> Oct 22 12:00:25 [694] lb02    pengine:     info: native_print: nginx
> (ocf::heartbeat:nginx): Started lb01 FAILED
> Oct 22 12:00:25 [694] lb02    pengine:     info: get_failcount_full: nginx
> has failed 1 times on lb01
> Oct 22 12:00:25 [694] lb02    pengine:     info: common_apply_stickiness: nginx
> can fail 999999 more times on lb01 before being forced off
> Oct 22 12:00:25 [694] lb02    pengine:     info: native_color: Resource
> nginx cannot run anywhere
> Oct 22 12:00:25 [694] lb02    pengine:   notice: LogActions: Stop    nginx
> (lb01)
> Oct 22 12:00:26 [690] lb02        cib:     info: cib_process_request: Completed
> cib_query operation for section
> //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']:
> No such device or address (rc=-6, origin=local/attrd/1049, version=0.253.15)
> Oct 22 12:00:26 [690] lb02        cib:     info: cib_process_request: Completed
> cib_query operation for section
> //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']:
> No such device or address (rc=-6, origin=local/attrd/1050, version=0.253.15)
> Oct 22 12:00:26 [694] lb02    pengine:    error: unpack_rsc_op: Preventing
> nginx from re-starting anywhere in the cluster : operation start failed
> 'not configured' (rc=6)
> Oct 22 12:00:26 [694] lb02    pengine:  warning: unpack_rsc_op: Processing
> failed op start for nginx on lb01: not configured (6)
> Oct 22 12:00:26 [694] lb02    pengine:     info: unpack_rsc_op: Operation
> monitor found resource nginx active on lb02
> Oct 22 12:00:26 [694] lb02    pengine:     info: native_print: nginx
> (ocf::heartbeat:nginx): Stopped
> Oct 22 12:00:26 [694] lb02    pengine:     info: get_failcount_full: nginx
> has failed INFINITY times on lb01
> Oct 22 12:00:26 [694] lb02    pengine:  warning: common_apply_stickiness: Forcing
> nginx away from lb01 after 1000000 failures (max=1000000)
> Oct 22 12:00:26 [694] lb02    pengine:     info: native_color: Resource
> nginx cannot run anywhere
> Oct 22 12:00:26 [694] lb02    pengine:     info: LogActions: Leave   nginx
> (Stopped)
>
> I can't for the life of me work out why this is happening. For whatever
> reason in node 1 -> node 2, it randomly decides that the resource can no
> longer run anywhere.
>
> And yes, I am making sure everything works before I start each test, so
> its not failure to use crm resource cleanup etc.
>
> Would really appreciate help on this as I've been trying to debug this for
> a few days and have hit a wall.
>
> Thanks,
>
> Lucas
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>


-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131022/3d39c8c4/attachment-0003.html>