[ClusterLabs] Misunderstanding or bug in crm_simulate output

Thu Jan 18 16:54:33 UTC 2018

On Thu, 2018-01-18 at 16:15 +0100, Jehan-Guillaume de Rorthais wrote:
> Hi list,
> 
> I was explaining how to use crm_simulate to a colleague when he
> pointed to me a
> non expected and buggy output.
> 
> Here are some simple steps to reproduce:
> 
>   $ pcs cluster setup --name usecase srv1 srv2 srv3
>   $ pcs cluster start --all
>   $ pcs property set stonith-enabled=false
>   $ pcs resource create dummy1 ocf:heartbeat:Dummy \
>     state=/tmp/dummy1.state                        \
>     op monitor interval=10s                        \
>     meta migration-threshold=3 resource-stickiness=1
> 
> Now, we are injecting 2 monitor soft errors, triggering 2 local
> recovery
> (stop/start):
> 
>   $ crm_simulate -S -L -i dummy1_monitor_10 at srv1=1 -O /tmp/step1.xml
>   $ crm_simulate -S -x /tmp/step1.xml -i dummy1_monitor_10 at srv1=1
>   -O /tmp/step2.xml
> 
> 
> So far so good. A third soft error on monitor push dummy1 out of
> srv1, this
> was expected. However, the final status of the cluster shows dummy1
> as
> started on both srv1 and srv2!
> 
>   $ crm_simulate -S -x /tmp/step2.xml -i dummy1_monitor_10 at srv1=1
>   -O /tmp/step3.xml
> 
>   Current cluster status:
>   Online: [ srv1 srv2 srv3 ]
> 
>    dummy1	(ocf::heartbeat:Dummy):	Started srv1
> 
>   Performing requested modifications
>    + Injecting dummy1_monitor_10 at srv1=1 into the configuration
>    + Injecting attribute fail-count-dummy1=value++ into /node_state
> '1'
>    + Injecting attribute last-failure-dummy1=1516287891 into
> /node_state '1'
> 
>   Transition Summary:
>    * Recover    dummy1     ( srv1 -> srv2 )  
> 
>   Executing cluster transition:
>    * Cluster action:  clear_failcount for dummy1 on srv1
>    * Resource action: dummy1          stop on srv1
>    * Resource action: dummy1          cancel=10 on srv1
>    * Pseudo action:   all_stopped
>    * Resource action: dummy1          start on srv2
>    * Resource action: dummy1          monitor=10000 on srv2
> 
>   Revised cluster status:
>   Online: [ srv1 srv2 srv3 ]
> 
>    dummy1	(ocf::heartbeat:Dummy):	Started[ srv1 srv2 ]
> 
> I suppose this is a bug from crm_simulate? Why is it considering
> dummy1 is
> started on srv1 when the transition execution stopped it on srv1?

It's definitely a bug, either in crm_simulate or the policy engine
itself. Can you attach step2.xml?

> 
> Taking the step3.xml output of this weird result force the cluster to
> stop
> dummy1 everywhere and start it on srv2 only:
> 
>   $ crm_simulate -S -x /tmp/step3.xml 
> 
>   Current cluster status:
>   Online: [ srv1 srv2 srv3 ]
> 
>    dummy1	(ocf::heartbeat:Dummy):	Started[ srv1 srv2 ]
> 
>   Transition Summary:
>    * Move       dummy1     ( srv1 -> srv2 )  
> 
>   Executing cluster transition:
>    * Resource action: dummy1          stop on srv2
>    * Resource action: dummy1          stop on srv1
>    * Pseudo action:   all_stopped
>    * Resource action: dummy1          start on srv2
>    * Resource action: dummy1          monitor=10000 on srv2
> 
>   Revised cluster status:
>   Online: [ srv1 srv2 srv3 ]
> 
>    dummy1	(ocf::heartbeat:Dummy):	Started srv2
> 
> 
> 
> Thoughts?
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>