[ClusterLabs] Misunderstanding or bug in crm_simulate output

Thu Jan 18 10:15:38 EST 2018

Hi list,

I was explaining how to use crm_simulate to a colleague when he pointed to me a
non expected and buggy output.

Here are some simple steps to reproduce:

  $ pcs cluster setup --name usecase srv1 srv2 srv3
  $ pcs cluster start --all
  $ pcs property set stonith-enabled=false
  $ pcs resource create dummy1 ocf:heartbeat:Dummy \
    state=/tmp/dummy1.state                        \
    op monitor interval=10s                        \
    meta migration-threshold=3 resource-stickiness=1

Now, we are injecting 2 monitor soft errors, triggering 2 local recovery
(stop/start):

  $ crm_simulate -S -L -i dummy1_monitor_10 at srv1=1 -O /tmp/step1.xml
  $ crm_simulate -S -x /tmp/step1.xml -i dummy1_monitor_10 at srv1=1
  -O /tmp/step2.xml

So far so good. A third soft error on monitor push dummy1 out of srv1, this
was expected. However, the final status of the cluster shows dummy1 as
started on both srv1 and srv2!

  $ crm_simulate -S -x /tmp/step2.xml -i dummy1_monitor_10 at srv1=1
  -O /tmp/step3.xml

  Current cluster status:
  Online: [ srv1 srv2 srv3 ]

   dummy1	(ocf::heartbeat:Dummy):	Started srv1

  Performing requested modifications
   + Injecting dummy1_monitor_10 at srv1=1 into the configuration
   + Injecting attribute fail-count-dummy1=value++ into /node_state '1'
   + Injecting attribute last-failure-dummy1=1516287891 into /node_state '1'

  Transition Summary:
   * Recover    dummy1     ( srv1 -> srv2 )  

  Executing cluster transition:
   * Cluster action:  clear_failcount for dummy1 on srv1
   * Resource action: dummy1          stop on srv1
   * Resource action: dummy1          cancel=10 on srv1
   * Pseudo action:   all_stopped
   * Resource action: dummy1          start on srv2
   * Resource action: dummy1          monitor=10000 on srv2

  Revised cluster status:
  Online: [ srv1 srv2 srv3 ]

   dummy1	(ocf::heartbeat:Dummy):	Started[ srv1 srv2 ]

I suppose this is a bug from crm_simulate? Why is it considering dummy1 is
started on srv1 when the transition execution stopped it on srv1?

Taking the step3.xml output of this weird result force the cluster to stop
dummy1 everywhere and start it on srv2 only:

  $ crm_simulate -S -x /tmp/step3.xml 

  Current cluster status:
  Online: [ srv1 srv2 srv3 ]

   dummy1	(ocf::heartbeat:Dummy):	Started[ srv1 srv2 ]

  Transition Summary:
   * Move       dummy1     ( srv1 -> srv2 )  

  Executing cluster transition:
   * Resource action: dummy1          stop on srv2
   * Resource action: dummy1          stop on srv1
   * Pseudo action:   all_stopped
   * Resource action: dummy1          start on srv2
   * Resource action: dummy1          monitor=10000 on srv2

  Revised cluster status:
  Online: [ srv1 srv2 srv3 ]

   dummy1	(ocf::heartbeat:Dummy):	Started srv2

Thoughts?