[ClusterLabs] Misunderstanding or bug in crm_simulate output

Wed Jan 24 18:42:56 EST 2018

On Fri, 2018-01-19 at 00:37 +0100, Jehan-Guillaume de Rorthais wrote:
> On Thu, 18 Jan 2018 10:54:33 -0600
> Ken Gaillot <kgaillot at redhat.com> wrote:
> 
> > On Thu, 2018-01-18 at 16:15 +0100, Jehan-Guillaume de Rorthais
> > wrote:
> > > Hi list,
> > > 
> > > I was explaining how to use crm_simulate to a colleague when he
> > > pointed to me a
> > > non expected and buggy output.
> > > 
> > > Here are some simple steps to reproduce:
> > > 
> > >   $ pcs cluster setup --name usecase srv1 srv2 srv3
> > >   $ pcs cluster start --all
> > >   $ pcs property set stonith-enabled=false
> > >   $ pcs resource create dummy1 ocf:heartbeat:Dummy \
> > >     state=/tmp/dummy1.state                        \
> > >     op monitor interval=10s                        \
> > >     meta migration-threshold=3 resource-stickiness=1
> > > 
> > > Now, we are injecting 2 monitor soft errors, triggering 2 local
> > > recovery
> > > (stop/start):
> > > 
> > >   $ crm_simulate -S -L -i dummy1_monitor_10 at srv1=1 -O
> > > /tmp/step1.xml
> > >   $ crm_simulate -S -x /tmp/step1.xml -i dummy1_monitor_10 at srv1=1
> > >   -O /tmp/step2.xml
> > > 
> > > 
> > > So far so good. A third soft error on monitor push dummy1 out of
> > > srv1, this
> > > was expected. However, the final status of the cluster shows
> > > dummy1
> > > as
> > > started on both srv1 and srv2!
> > > 
> > >   $ crm_simulate -S -x /tmp/step2.xml -i dummy1_monitor_10 at srv1=1
> > >   -O /tmp/step3.xml
> > > 
> > >   Current cluster status:
> > >   Online: [ srv1 srv2 srv3 ]
> > > 
> > >    dummy1	(ocf::heartbeat:Dummy):	Started srv1
> > > 
> > >   Performing requested modifications
> > >    + Injecting dummy1_monitor_10 at srv1=1 into the configuration
> > >    + Injecting attribute fail-count-dummy1=value++ into
> > > /node_state
> > > '1'
> > >    + Injecting attribute last-failure-dummy1=1516287891 into
> > > /node_state '1'
> > > 
> > >   Transition Summary:
> > >    * Recover    dummy1     ( srv1 -> srv2 )  
> > > 
> > >   Executing cluster transition:
> > >    * Cluster action:  clear_failcount for dummy1 on srv1
> > >    * Resource action: dummy1          stop on srv1
> > >    * Resource action: dummy1          cancel=10 on srv1
> > >    * Pseudo action:   all_stopped
> > >    * Resource action: dummy1          start on srv2
> > >    * Resource action: dummy1          monitor=10000 on srv2
> > > 
> > >   Revised cluster status:
> > >   Online: [ srv1 srv2 srv3 ]
> > > 
> > >    dummy1	(ocf::heartbeat:Dummy):	Started[ srv1
> > > srv2 ]
> > > 
> > > I suppose this is a bug from crm_simulate? Why is it considering
> > > dummy1 is
> > > started on srv1 when the transition execution stopped it on
> > > srv1?  
> > 
> > It's definitely a bug, either in crm_simulate or the policy engine
> > itself. Can you attach step2.xml?
> 
> Sure, please, find in attachment step2.xml.

I can reproduce the issue with 1.1.16 but not 1.1.17 or later, so
whatever it was, it got fixed.
-- 
Ken Gaillot <kgaillot at redhat.com>