[Pacemaker] The strange behavior of Master/Slave when it failed to demote

renayama19661014 at ybb.ne.jp renayama19661014 at ybb.ne.jp
Wed Jan 23 05:11:06 UTC 2013


Hi All,

I registered a problem at bugzilla in place of Miss Ikeda.
 * http://bugs.clusterlabs.org/show_bug.cgi?id=5133

Best Regards,
Hideo Yamauchi.


--- On Thu, 2013/1/10, Junko IKEDA <tsukishima.ha at gmail.com> wrote:

> 
> 
> Hi,
> 
> I'm running Stateful RA with Pacemaker 1.0.12, and found that its demote behavior is something wrong.
> 
> This is my configuration;
> There is no stonith devices, and demote/stop are set as on-fail="block".
> 
> # crm configure show
> node $id="21c624bd-c426-43dc-9665-bbfb92054bcd" dl380g5c \
> node $id="3f6ec88d-ee47-4f63-bfeb-652b8dd96027" dl380g5d
> primitive dummy ocf:pacemaker:Stateful \
>         op start interval="0s" timeout="100s" on-fail="restart" \
>         op monitor interval="10s" role="Master" timeout="100s" on-fail="restart" \
>         op monitor interval="20s" role="Slave" timeout="100s" on-fail="restart" \
>         op promote interval="0s" timeout="100s" on-fail="restart" \
>         op demote interval="0s" timeout="100s" on-fail="block" \
>         op stop interval="0s" timeout="100s" on-fail="block"
> ms stateful dummy
> property $id="cib-bootstrap-options" \
>         dc-version="1.0.12-066152e" \
>         cluster-infrastructure="Heartbeat" \
>         no-quorum-policy="ignore" \
>         stonith-enabled="false" \
>         startup-fencing="false" \
>         crmd-transition-delay="2s"
> rsc_defaults $id="rsc-options" \
>         resource-stickiness="INFINITY" \
>         migration-threshold="1"
> 
> 
> 
> 1) Initial status (dl380g5c=Master/dl380g5d=Slave)
> # crm_mon -1 -n
> 
> ============
> Last updated: Thu Jan 10 18:25:17 2013
> Stack: Heartbeat
> Current DC: dl380g5d (3f6ec88d-ee47-4f63-bfeb-652b8dd96027) - partition with quorum
> Version: 1.0.12-066152e
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
> 
> Node dl380g5c (21c624bd-c426-43dc-9665-bbfb92054bcd): online
>         dummy:0 (ocf::pacemaker:Stateful) Master
> Node dl380g5d (3f6ec88d-ee47-4f63-bfeb-652b8dd96027): online
>         dummy:1 (ocf::pacemaker:Stateful) Started
> 
> 
> 
> 2) Modify Stateful RA to reprodece "demote NG", and put the Master node into standby mode.
> 
> # vim /usr/lib/ocf/resource.d/pacemaker/Stateful
> stateful_demote() {
> return $OCF_ERR_GENERIC
> 
>     stateful_check_state
>     if [ $? = 0 ]; then
>         # CRM Error - Should never happen
>         return $OCF_NOT_RUNNING
> 
> ...
> 
> 
> # crm node standby dl380g5c
> # crm_mon -1 -n
> ============
> Last updated: Thu Jan 10 18:27:04 2013
> Stack: Heartbeat
> Current DC: dl380g5d (3f6ec88d-ee47-4f63-bfeb-652b8dd96027) - partition with quorum
> Version: 1.0.12-066152e
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
> 
> Node dl380g5c (21c624bd-c426-43dc-9665-bbfb92054bcd): standby
>         dummy:0 (ocf::pacemaker:Stateful) Slave  (unmanaged) FAILED
> Node dl380g5d (3f6ec88d-ee47-4f63-bfeb-652b8dd96027): online
>         dummy:1 (ocf::pacemaker:Stateful) Master
> 
> Failed actions:
>     dummy:0_demote_0 (node=dl380g5c, call=4, rc=1, status=complete): unknown error
> 
> 
> In the above crm_mon, dl380g5c's status is "Slave", but it might be still "Master" because it failed to demote.
> So dl380g5d should be prohibited from its promoting action to prevent the multiple Master.
> It seems that Pacemaker 1.1 shows the same behavior as 1.0.12.
> I'm not sure but Pacemaker 1.0.11's behavior is correct(dl380g5d can not promote).
> Please see the attached hb_report.
> 
> 
> Jan 10 18:27:01 dl380g5d pengine: [4297]: info: determine_online_status: Node dl380g5c is standby
> Jan 10 18:27:01 dl380g5d pengine: [4297]: info: determine_online_status: Node dl380g5d is online
> Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: unpack_rsc_op: Operation dummy:0_monitor_0 found resource dummy:0 active in master mode on dl380g5c
> Jan 10 18:27:01 dl380g5d pengine: [4297]: WARN: unpack_rsc_op: Processing failed op dummy:0_demote_0 on dl380g5c: unknown error (1)
> Jan 10 18:27:01 dl380g5d pengine: [4297]: WARN: unpack_rsc_op: Forcing dummy:0 to stop after a failed demote action
> Jan 10 18:27:01 dl380g5d pengine: [4297]: info: native_add_running: resource dummy:0 isnt managed
> Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: clone_print:  Master/Slave Set: stateful
> Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: native_print:      dummy:0	(ocf::pacemaker:Stateful):	Slave dl380g5c (unmanaged) FAILED
> Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: short_print:      Slaves: [ dl380g5d ]
> Jan 10 18:27:01 dl380g5d pengine: [4297]: info: get_failcount: stateful has failed 1 times on dl380g5c
> Jan 10 18:27:01 dl380g5d pengine: [4297]: WARN: common_apply_stickiness: Forcing stateful away from dl380g5c after 1 failures (max=1)
> Jan 10 18:27:01 dl380g5d pengine: [4297]: info: get_failcount: stateful has failed 1 times on dl380g5c
> Jan 10 18:27:01 dl380g5d pengine: [4297]: WARN: common_apply_stickiness: Forcing stateful away from dl380g5c after 1 failures (max=1)
> Jan 10 18:27:01 dl380g5d pengine: [4297]: info: native_color: Unmanaged resource dummy:0 allocated to 'nowhere': failed
> Jan 10 18:27:01 dl380g5d pengine: [4297]: info: master_color: Promoting dummy:1 (Slave dl380g5d)
> Jan 10 18:27:01 dl380g5d pengine: [4297]: info: master_color: stateful: Promoted 1 instances of a possible 1 to master
> Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: RecurringOp:  Start recurring monitor (10s) for dummy:1 on dl380g5d
> Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: RecurringOp:  Start recurring monitor (10s) for dummy:1 on dl380g5d
> Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: LogActions: Leave   resource dummy:0	(Slave unmanaged)
> Jan 10 18:27:01 dl380g5d pengine: [4297]: notice: LogActions: Promote dummy:1	(Slave -> Master dl380g5d)
> 
> 
> 
> Best Regards,
> Junko IKEDA
> 
> NTT DATA INTELLILINK CORPORATION




More information about the Pacemaker mailing list