[Pacemaker] prevent the resource's start if it has "stop NG" history on the other node

Wed Feb 29 04:08:25 EST 2012

Hi,

additional information;
(1) resource is running on DC
(2) shutdown Pacemaker on DC, and resource goes into stop NG(unmanaged)
(3) the other node becomes DC
(4) resource starts on the new DC
(this resource has unmanaged status on the old DC...)

see attached the other hb_report.

By the way, this patch means,
if there are some unmanaged resources, the operation of "Pacemaker
shutdown" becomes successful, right?

High: PE: Bug lf#1959 - Fail unmanaged resources should not prevent
other services from shutting down
https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c

I don't know the detail of lf#1959, and it would be better to setup
STONITH to handle "stop" fail unmanaged resource,
but stop NG action do not permit Pacemaker to shutdown itself just in case.

Thanks,
Junko

2012/2/29 Junko IKEDA <tsukishima.ha at gmail.com>:
> Hi,
>
> I'm running the following simple configuration with Pacemaker 1.1.6,
> and try the test case, "resource stop NG and shutdown Pacemaker".
>
> property \
>    no-quorum-policy="ignore" \
>    stonith-enabled="false" \
>    crmd-transition-delay="2s"
>
> rsc_defaults \
>    resource-stickiness="INFINITY" \
>    migration-threshold="1"
>
> primitive dummy01 ocf:heartbeat:Dummy-stop-NG \
>    op start   timeout="60s" interval="0s"  on-fail="restart" \
>    op monitor timeout="60s" interval="7s"  on-fail="restart" \
>    op stop    timeout="60s" interval="0s"  on-fail="block"
>
>
> "Dummy-stop-NG" RA just sends "stop NG" to Pacemaker.
>
> # diff -urNp Dummy Dummy-stop-NG
> --- Dummy       2011-06-30 17:43:37.000000000 +0900
> +++ Dummy-stop-NG       2012-02-28 19:11:12.850207767 +0900
> @@ -108,6 +108,8 @@ dummy_start() {
>  }
>
>  dummy_stop() {
> +    exit $OCF_ERR_GENERIC
> +
>     dummy_monitor
>     if [ $? =  $OCF_SUCCESS ]; then
>        rm ${OCF_RESKEY_state}
>
>
>
> Before the test, the resource is running on "bl460g6a".
>
> # crm_simulate -S -x pe-input-1.bz2
>
> Current cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Stopped
>
> Transition Summary:
> crm_simulate[14195]: 2012/02/29_15:46:57 notice: LogActions: Start
> dummy01    (bl460g6a)
>
> Executing cluster transition:
>  * Executing action 6: dummy01_monitor_0 on bl460g6b
>  * Executing action 4: dummy01_monitor_0 on bl460g6a
>  * Executing action 7: dummy01_start_0 on bl460g6a
>  * Executing action 8: dummy01_monitor_7000 on bl460g6a
>
> Revised cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>
>
>
> Stop Pacemaker on "bl460g6a".
> # service heartbeat stop
>
> Pacemaker tries to stop resouce and move it to "bl460g6b" at first,
> # crm_simulate -S -x pe-input-2.bz2
>
> Current cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>
> Transition Summary:
> crm_simulate[12195]: 2012/02/29_15:35:02 notice: LogActions: Move
> dummy01    (Started bl460g6a -> bl460g6b)
>
> Executing cluster transition:
>  * Executing action 6: dummy01_stop_0 on bl460g6a
>  * Executing action 7: dummy01_start_0 on bl460g6b
>  * Executing action 8: dummy01_monitor_7000 on bl460g6b
>
> Revised cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6b
>
>
>
> but this action will fail, it means the resource goes into unmanaged state.
> # crm_simulate -S -x pe-input-3.bz2
>
> Current cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
> (unmanaged) FAILED
>
> Transition Summary:
>
> Executing cluster transition:
>
> Revised cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
> (unmanaged) FAILED
>
>
>
> Pacemaker shutdown on "bl460g6a" becomes successful,
> it seems that the following patch works well.
> https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
>
> At this time, the resource on "bl460g6a" (pacemaker already shutdowns)
> might be running because it fails to stop.
> In fact, the resource didn't start on "bl460g6b" after its stop NG and
> "bl460g6a"'s shutdown, and this is an expectable behavior,
> but I could start it on "bl460g6b" with crm command.
> This holds the potential for the unexpected active/active status.
> Is it possible to prevent it's start in this situation?
> for example,
> (1) Dummy runs on node-a
> (2) Shutdown Pacemaker on node-a, and Dummy stop NG
> (3) Dummy can not run on other nodes
> (4) * cleanup the unmanaged status of Dummy after checking it's manual
> operation on node-a
> (5) * start Dummy on other nodes
> This can be the safe way.
>
> See attached hb_report.
>
> Thanks,
> Junko IKEDA
>
> NTT DATA INTELLILINK CORPORATION
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hb_report_dc.tar.bz2
Type: application/x-bzip2
Size: 66061 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120229/357c3eec/attachment-0003.bz2>