[Pacemaker] The strange behavior of Master/Slave when it failed to demote

Junko IKEDA tsukishima.ha at gmail.com
Thu Jan 10 10:31:48 UTC 2013


Hi,

I'm running Stateful RA with Pacemaker 1.0.12, and found that its demote
behavior is something wrong.

This is my configuration;
There is no stonith devices, and demote/stop are set as on-fail="block".

# crm configure show
node $id="21c624bd-c426-43dc-9665-bbfb92054bcd" dl380g5c \
node $id="3f6ec88d-ee47-4f63-bfeb-652b8dd96027" dl380g5d
primitive dummy ocf:pacemaker:Stateful \
        op start interval="0s" timeout="100s" on-fail="restart" \
        op monitor interval="10s" role="Master" timeout="100s"
on-fail="restart" \
        op monitor interval="20s" role="Slave" timeout="100s"
on-fail="restart" \
        op promote interval="0s" timeout="100s" on-fail="restart" \
        op demote interval="0s" timeout="100s" on-fail="block" \
        op stop interval="0s" timeout="100s" on-fail="block"
ms stateful dummy
property $id="cib-bootstrap-options" \
        dc-version="1.0.12-066152e" \
        cluster-infrastructure="Heartbeat" \
        no-quorum-policy="ignore" \
        stonith-enabled="false" \
        startup-fencing="false" \
        crmd-transition-delay="2s"
rsc_defaults $id="rsc-options" \
        resource-stickiness="INFINITY" \
        migration-threshold="1"



1) Initial status (dl380g5c=Master/dl380g5d=Slave)
# crm_mon -1 -n

============
Last updated: Thu Jan 10 18:25:17 2013
Stack: Heartbeat
Current DC: dl380g5d (3f6ec88d-ee47-4f63-bfeb-652b8dd96027) - partition
with quorum
Version: 1.0.12-066152e
2 Nodes configured, unknown expected votes
1 Resources configured.
============

Node dl380g5c (21c624bd-c426-43dc-9665-bbfb92054bcd): online
        dummy:0 (ocf::pacemaker:Stateful) Master
Node dl380g5d (3f6ec88d-ee47-4f63-bfeb-652b8dd96027): online
        dummy:1 (ocf::pacemaker:Stateful) Started



2) Modify Stateful RA to reprodece "demote NG", and put the Master node
into standby mode.

# vim /usr/lib/ocf/resource.d/pacemaker/Stateful
stateful_demote() {
return $OCF_ERR_GENERIC

    stateful_check_state
    if [ $? = 0 ]; then
        # CRM Error - Should never happen
        return $OCF_NOT_RUNNING

...


# crm node standby dl380g5c
# crm_mon -1 -n
============
Last updated: Thu Jan 10 18:27:04 2013
Stack: Heartbeat
Current DC: dl380g5d (3f6ec88d-ee47-4f63-bfeb-652b8dd96027) - partition
with quorum
Version: 1.0.12-066152e
2 Nodes configured, unknown expected votes
1 Resources configured.
============

Node dl380g5c (21c624bd-c426-43dc-9665-bbfb92054bcd): standby
        dummy:0 (ocf::pacemaker:Stateful) Slave  (unmanaged) FAILED
Node dl380g5d (3f6ec88d-ee47-4f63-bfeb-652b8dd96027): online
        dummy:1 (ocf::pacemaker:Stateful) Master

Failed actions:
    dummy:0_demote_0 (node=dl380g5c, call=4, rc=1, status=complete):
unknown error


In the above crm_mon, dl380g5c's status is "Slave", but it might be still
"Master" because it failed to demote.
So dl380g5d should be prohibited from its promoting action to prevent the
multiple Master.
It seems that Pacemaker 1.1 shows the same behavior as 1.0.12.
I'm not sure but Pacemaker 1.0.11's behavior is correct(dl380g5d can not
promote).
Please see the attached hb_report.


Jan 10 18:27:01 dl380g5d pengine: [4297]: info: determine_online_status:
Node dl380g5c is standby
Jan 10 18:27:01 dl380g5d pengine: [4297]: info: determine_online_status:
Node dl380g5d is online
Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: unpack_rsc_op: Operation
dummy:0_monitor_0 found resource dummy:0 active in master mode on dl380g5c
Jan 10 18:27:01 dl380g5d pengine: [4297]: WARN: unpack_rsc_op: Processing
failed op dummy:0_demote_0 on dl380g5c: unknown error (1)
Jan 10 18:27:01 dl380g5d pengine: [4297]: WARN: unpack_rsc_op: Forcing
dummy:0 to stop after a failed demote action
Jan 10 18:27:01 dl380g5d pengine: [4297]: info: native_add_running:
resource dummy:0 isnt managed
Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: clone_print:
 Master/Slave Set: stateful
Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: native_print:      dummy:0
(ocf::pacemaker:Stateful): Slave dl380g5c (unmanaged) FAILED
Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: short_print:      Slaves:
[ dl380g5d ]
Jan 10 18:27:01 dl380g5d pengine: [4297]: info: get_failcount: stateful has
failed 1 times on dl380g5c
Jan 10 18:27:01 dl380g5d pengine: [4297]: WARN: common_apply_stickiness:
Forcing stateful away from dl380g5c after 1 failures (max=1)
Jan 10 18:27:01 dl380g5d pengine: [4297]: info: get_failcount: stateful has
failed 1 times on dl380g5c
Jan 10 18:27:01 dl380g5d pengine: [4297]: WARN: common_apply_stickiness:
Forcing stateful away from dl380g5c after 1 failures (max=1)
Jan 10 18:27:01 dl380g5d pengine: [4297]: info: native_color: Unmanaged
resource dummy:0 allocated to 'nowhere': failed
Jan 10 18:27:01 dl380g5d pengine: [4297]: info: master_color: Promoting
dummy:1 (Slave dl380g5d)
Jan 10 18:27:01 dl380g5d pengine: [4297]: info: master_color: stateful:
Promoted 1 instances of a possible 1 to master
Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: RecurringOp:  Start
recurring monitor (10s) for dummy:1 on dl380g5d
Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: RecurringOp:  Start
recurring monitor (10s) for dummy:1 on dl380g5d
Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: LogActions: Leave
resource dummy:0 (Slave unmanaged)
Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: LogActions: Promote
dummy:1 (Slave -> Master dl380g5d)



Best Regards,
Junko IKEDA

NTT DATA INTELLILINK CORPORATION
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130110/e312a452/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hb_report.tar.bz2
Type: application/x-bzip2
Size: 53251 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130110/e312a452/attachment-0003.bz2>


More information about the Pacemaker mailing list