[Pacemaker] prevent the resource's start if it has "stop NG" history on the other node

Wed Feb 29 21:33:30 EST 2012

On Wed, Feb 29, 2012 at 6:32 PM, Junko IKEDA <tsukishima.ha at gmail.com> wrote:
> Hi,
>
> I'm running the following simple configuration with Pacemaker 1.1.6,
> and try the test case, "resource stop NG and shutdown Pacemaker".
>
> property \
>    no-quorum-policy="ignore" \
>    stonith-enabled="false" \
>    crmd-transition-delay="2s"
>
> rsc_defaults \
>    resource-stickiness="INFINITY" \
>    migration-threshold="1"
>
> primitive dummy01 ocf:heartbeat:Dummy-stop-NG \
>    op start   timeout="60s" interval="0s"  on-fail="restart" \
>    op monitor timeout="60s" interval="7s"  on-fail="restart" \
>    op stop    timeout="60s" interval="0s"  on-fail="block"
>
>
> "Dummy-stop-NG" RA just sends "stop NG" to Pacemaker.
>
> # diff -urNp Dummy Dummy-stop-NG
> --- Dummy       2011-06-30 17:43:37.000000000 +0900
> +++ Dummy-stop-NG       2012-02-28 19:11:12.850207767 +0900
> @@ -108,6 +108,8 @@ dummy_start() {
>  }
>
>  dummy_stop() {
> +    exit $OCF_ERR_GENERIC
> +
>     dummy_monitor
>     if [ $? =  $OCF_SUCCESS ]; then
>        rm ${OCF_RESKEY_state}
>
>
>
> Before the test, the resource is running on "bl460g6a".
>
> # crm_simulate -S -x pe-input-1.bz2
>
> Current cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Stopped
>
> Transition Summary:
> crm_simulate[14195]: 2012/02/29_15:46:57 notice: LogActions: Start
> dummy01    (bl460g6a)
>
> Executing cluster transition:
>  * Executing action 6: dummy01_monitor_0 on bl460g6b
>  * Executing action 4: dummy01_monitor_0 on bl460g6a
>  * Executing action 7: dummy01_start_0 on bl460g6a
>  * Executing action 8: dummy01_monitor_7000 on bl460g6a
>
> Revised cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>
>
>
> Stop Pacemaker on "bl460g6a".
> # service heartbeat stop
>
> Pacemaker tries to stop resouce and move it to "bl460g6b" at first,
> # crm_simulate -S -x pe-input-2.bz2
>
> Current cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
>
> Transition Summary:
> crm_simulate[12195]: 2012/02/29_15:35:02 notice: LogActions: Move
> dummy01    (Started bl460g6a -> bl460g6b)
>
> Executing cluster transition:
>  * Executing action 6: dummy01_stop_0 on bl460g6a
>  * Executing action 7: dummy01_start_0 on bl460g6b
>  * Executing action 8: dummy01_monitor_7000 on bl460g6b
>
> Revised cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6b
>
>
>
> but this action will fail, it means the resource goes into unmanaged state.
> # crm_simulate -S -x pe-input-3.bz2
>
> Current cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
> (unmanaged) FAILED
>
> Transition Summary:
>
> Executing cluster transition:
>
> Revised cluster status:
> Online: [ bl460g6a bl460g6b ]
>
>  dummy01        (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a
> (unmanaged) FAILED
>
>
>
> Pacemaker shutdown on "bl460g6a" becomes successful,
> it seems that the following patch works well.
> https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
>
> At this time, the resource on "bl460g6a" (pacemaker already shutdowns)
> might be running because it fails to stop.

This is because we ignore the status section of any offline nodes when
stonith-enabled=false.

> In fact, the resource didn't start on "bl460g6b" after its stop NG and
> "bl460g6a"'s shutdown, and this is an expectable behavior,
> but I could start it on "bl460g6b" with crm command.
> This holds the potential for the unexpected active/active status.
> Is it possible to prevent it's start in this situation?

Only by disabling the logic in
   https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c
when stonith is disabled.

> for example,
> (1) Dummy runs on node-a
> (2) Shutdown Pacemaker on node-a, and Dummy stop NG
> (3) Dummy can not run on other nodes
> (4) * cleanup the unmanaged status of Dummy after checking it's manual
> operation on node-a
> (5) * start Dummy on other nodes
> This can be the safe way.
>
> See attached hb_report.
>
> Thanks,
> Junko IKEDA
>
> NTT DATA INTELLILINK CORPORATION
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>