[Pacemaker] Resource only failsover in one direction

Mon Oct 21 23:45:21 EDT 2013

Hey guys,

I'm encountering a really strange problem testing failover of my
ocf:heartbeat:nginx resource in my 2 node cluster. I am able to manually
migrate the resource around the nodes and that works fine, but I can't get
the resource to function on one node while the other has encountered a
failure. The strange part is this only happens if the failure was on node
1. If I reproduce the failure on node 2 the resource will correctly
failover to node 1.

no-quorum-policy is ignore, so that doesn't seem to be the issue, and some
similar threads mentioned start-failure-is-fatal=false may help, but it
doesn't resolve it either. I have a more advanced configuration that
includes a Virtual IP and ping clones, and those parts seem to work fine,
and nginx even failsover correctly when its host goes offline completely.
Just can't get the same behaviour to work when only the resource fails.

My test case:

>vim /etc/nginx/nginx.conf
>Insert invalid jargon and save
>service nginx restart

Expected outcome: Resource failsover to the other node upon monitor failure
in either direction between my 2 nodes.
Actual: Resource failsover correctly from node 2 -> node 1, but not node 1
-> node 2.

This is my test configuration for reproducing the issue (to make sure my
other stuff isn't interfering).
-----------------------
node $id="724150464" lb01
node $id="740927680" lb02
primitive nginx ocf:heartbeat:nginx \
params configfile="/etc/nginx/nginx.conf" \
op monitor interval="10s" timeout="30s" depth="0" \
op monitor interval="15s" timeout="30s" status10url="
http://localhost/nginx_status" depth="10"
property $id="cib-bootstrap-options" \
dc-version="1.1.10-42f2063" \
cluster-infrastructure="corosync" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
start-failure-is-fatal="false" \
last-lrm-refresh="1382410708"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"

This is what happens when I perform the test case on node lb02, and it
correctly migrates/restarts the resource on lb01.
-----------------------
Oct 22 11:58:12 [694] lb02    pengine:  warning: unpack_rsc_op: Processing
failed op monitor for nginx on lb02: not running (7)
Oct 22 11:58:12 [694] lb02    pengine:     info: native_print: nginx
(ocf::heartbeat:nginx): Started lb02 FAILED
Oct 22 11:58:12 [694] lb02    pengine:     info: RecurringOp:  Start
recurring monitor (10s) for nginx on lb02
Oct 22 11:58:12 [694] lb02    pengine:     info: RecurringOp:  Start
recurring monitor (15s) for nginx on lb02
Oct 22 11:58:12 [694] lb02    pengine:   notice: LogActions: Recover
nginx (Started
lb02)
Oct 22 11:58:12 [690] lb02        cib:     info: cib_process_request: Completed
cib_query operation for section
//cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']:
No such device or address (rc=-6, origin=local/attrd/1038, version=0.252.2)
Oct 22 11:58:12 [692] lb02       lrmd:     info:
cancel_recurring_action: Cancelling
operation nginx_monitor_15000
Oct 22 11:58:12 [692] lb02       lrmd:     info:
cancel_recurring_action: Cancelling
operation nginx_monitor_10000
Oct 22 11:58:12 [692] lb02       lrmd:     info: log_execute: executing -
rsc:nginx action:stop call_id:848
Oct 22 11:58:12 [695] lb02       crmd:     info: process_lrm_event: LRM
operation nginx_monitor_15000 (call=839, status=1, cib-update=0,
confirmed=true) Cancelled
Oct 22 11:58:12 [695] lb02       crmd:     info: process_lrm_event: LRM
operation nginx_monitor_10000 (call=841, status=1, cib-update=0,
confirmed=true) Cancelled
Oct 22 11:58:12 [690] lb02        cib:     info: cib_process_request: Completed
cib_query operation for section
//cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='last-failure-nginx']:
OK (rc=0, origin=local/attrd/1041, version=0.252.3)
nginx[31237]: 2013/10/22_11:58:12 INFO: nginx is not running.
Oct 22 11:58:12 [692] lb02       lrmd:     info: log_finished: finished -
rsc:nginx action:stop call_id:848 pid:31237 exit-code:0 exec-time:155ms
queue-time:0ms
Oct 22 11:58:12 [695] lb02       crmd:   notice: process_lrm_event: LRM
operation nginx_stop_0 (call=848, rc=0, cib-update=593, confirmed=true) ok
Oct 22 11:58:12 [694] lb02    pengine:     info: unpack_rsc_op: Operation
monitor found resource nginx active on lb01
Oct 22 11:58:12 [694] lb02    pengine:  warning: unpack_rsc_op: Processing
failed op monitor for nginx on lb02: not running (7)
Oct 22 11:58:12 [694] lb02    pengine:     info: native_print: nginx
(ocf::heartbeat:nginx): Stopped
Oct 22 11:58:12 [694] lb02    pengine:     info: get_failcount_full: nginx
has failed 1 times on lb02
Oct 22 11:58:12 [694] lb02    pengine:     info: common_apply_stickiness: nginx
can fail 999999 more times on lb02 before being forced off
Oct 22 11:58:12 [694] lb02    pengine:     info: RecurringOp:  Start
recurring monitor (10s) for nginx on lb01
Oct 22 11:58:12 [694] lb02    pengine:     info: RecurringOp:  Start
recurring monitor (15s) for nginx on lb01
Oct 22 11:58:12 [694] lb02    pengine:   notice: LogActions: Start   nginx
(lb01)

This is what happens when I try to go from lb01 -> lb02.
-----------------------
Oct 22 12:00:25 [694] lb02    pengine:  warning: unpack_rsc_op: Processing
failed op monitor for nginx on lb01: not running (7)
Oct 22 12:00:25 [694] lb02    pengine:     info: unpack_rsc_op: Operation
monitor found resource nginx active on lb02
Oct 22 12:00:25 [694] lb02    pengine:     info: native_print: nginx
(ocf::heartbeat:nginx): Started lb01 FAILED
Oct 22 12:00:25 [694] lb02    pengine:     info: RecurringOp:  Start
recurring monitor (10s) for nginx on lb01
Oct 22 12:00:25 [694] lb02    pengine:     info: RecurringOp:  Start
recurring monitor (15s) for nginx on lb01
Oct 22 12:00:25 [694] lb02    pengine:   notice: LogActions: Recover
nginx (Started
lb01)
Oct 22 12:00:25 [690] lb02        cib:     info: cib_process_request: Completed
cib_query operation for section
//cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']:
No such device or address (rc=-6, origin=local/attrd/1046, version=0.253.12)
Oct 22 12:00:25 [690] lb02        cib:     info: cib_process_request: Completed
cib_query operation for section
//cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='last-failure-nginx']:
OK (rc=0, origin=local/attrd/1047, version=0.253.12)
Oct 22 12:00:25 [694] lb02    pengine:  warning: unpack_rsc_op: Processing
failed op monitor for nginx on lb01: not running (7)
Oct 22 12:00:25 [694] lb02    pengine:     info: unpack_rsc_op: Operation
monitor found resource nginx active on lb02
Oct 22 12:00:25 [694] lb02    pengine:     info: native_print: nginx
(ocf::heartbeat:nginx): Stopped
Oct 22 12:00:25 [694] lb02    pengine:     info: get_failcount_full: nginx
has failed 1 times on lb01
Oct 22 12:00:25 [694] lb02    pengine:     info: common_apply_stickiness: nginx
can fail 999999 more times on lb01 before being forced off
Oct 22 12:00:25 [694] lb02    pengine:     info: RecurringOp:  Start
recurring monitor (10s) for nginx on lb01
Oct 22 12:00:25 [694] lb02    pengine:     info: RecurringOp:  Start
recurring monitor (15s) for nginx on lb01
Oct 22 12:00:25 [694] lb02    pengine:   notice: LogActions: Start   nginx
(lb01)
Oct 22 12:00:25 [694] lb02    pengine:    error: unpack_rsc_op: Preventing
nginx from re-starting anywhere in the cluster : operation start failed
'not configured' (rc=6)
Oct 22 12:00:25 [694] lb02    pengine:  warning: unpack_rsc_op: Processing
failed op start for nginx on lb01: not configured (6)
Oct 22 12:00:25 [694] lb02    pengine:     info: unpack_rsc_op: Operation
monitor found resource nginx active on lb02
Oct 22 12:00:25 [694] lb02    pengine:     info: native_print: nginx
(ocf::heartbeat:nginx): Started lb01 FAILED
Oct 22 12:00:25 [694] lb02    pengine:     info: get_failcount_full: nginx
has failed 1 times on lb01
Oct 22 12:00:25 [694] lb02    pengine:     info: common_apply_stickiness: nginx
can fail 999999 more times on lb01 before being forced off
Oct 22 12:00:25 [694] lb02    pengine:     info: native_color: Resource
nginx cannot run anywhere
Oct 22 12:00:25 [694] lb02    pengine:   notice: LogActions: Stop    nginx
(lb01)
Oct 22 12:00:26 [690] lb02        cib:     info: cib_process_request: Completed
cib_query operation for section
//cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']:
No such device or address (rc=-6, origin=local/attrd/1049, version=0.253.15)
Oct 22 12:00:26 [690] lb02        cib:     info: cib_process_request: Completed
cib_query operation for section
//cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']:
No such device or address (rc=-6, origin=local/attrd/1050, version=0.253.15)
Oct 22 12:00:26 [694] lb02    pengine:    error: unpack_rsc_op: Preventing
nginx from re-starting anywhere in the cluster : operation start failed
'not configured' (rc=6)
Oct 22 12:00:26 [694] lb02    pengine:  warning: unpack_rsc_op: Processing
failed op start for nginx on lb01: not configured (6)
Oct 22 12:00:26 [694] lb02    pengine:     info: unpack_rsc_op: Operation
monitor found resource nginx active on lb02
Oct 22 12:00:26 [694] lb02    pengine:     info: native_print: nginx
(ocf::heartbeat:nginx): Stopped
Oct 22 12:00:26 [694] lb02    pengine:     info: get_failcount_full: nginx
has failed INFINITY times on lb01
Oct 22 12:00:26 [694] lb02    pengine:  warning:
common_apply_stickiness: Forcing
nginx away from lb01 after 1000000 failures (max=1000000)
Oct 22 12:00:26 [694] lb02    pengine:     info: native_color: Resource
nginx cannot run anywhere
Oct 22 12:00:26 [694] lb02    pengine:     info: LogActions: Leave   nginx
(Stopped)

I can't for the life of me work out why this is happening. For whatever
reason in node 1 -> node 2, it randomly decides that the resource can no
longer run anywhere.

And yes, I am making sure everything works before I start each test, so its
not failure to use crm resource cleanup etc.

Would really appreciate help on this as I've been trying to debug this for
a few days and have hit a wall.

Thanks,

Lucas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131022/989b93dc/attachment-0002.html>