[ClusterLabs] Understanding advisory resource ordering

Fri Dec 20 07:37:55 EST 2019

Hi all,

TL;DR: Apparently, both the order of adding resources to the CIB and
the number of CPU cores on a pacemaker node might impact advisory
(optional) startorder constraints. This makes it difficult to configure
them for full cluster restarts, e.g. after a power outage, when the
number of resources affected exceeds twice the number of cores. I'm
looking for advice of how to correctly configure this scenario. What
seems to help is setting the parameter node-action-limit to some high
value. The resources are started in the correct order, but some start
sequentially that would be allowed to start in parallel.

Some background and how to reproduce:

We have a two-node DRBD, qemu/kvm, pacemaker cluster setup, based on
Ubuntu 16.04 with the stock Pacemaker 1.1.14-2ubuntu1.6. We struggle
with advisory (optional) resource ordering:

a) We successfully have been using mandatory, symmetrical resource
ordering for ages now, basically sequencing the DRBD promote and
qemu/kvm startup. No issues here.

b) On clusters with high load, we intend to use asymmetric advisory
resource ordering for failover and cluster startup after power outages.
This works perfectly in synthetic test cases, starting the resources
simultaneously via the crm shell, but fails on a full pacemaker
restart. Even when start decisions appear to occur at the same point
in time, we see real startup in a seemingly random order. While some
advisory order constraints appear to be effective, others are not.

After lots of logfile analyzing and test setups, it seems that the
order of resource definitions in the raw CIB xml is significant. It can
be reproduced with a simple setup on a single pacemaker instance, using
default dummy resource agents. We've used a tiny virtual machine for
this simple test (2 cores, 4g RAM)).

   crm configure << 'EOF'
   property cib-bootstrap-options: \
       stonith-enabled=false no-quorum-policy=ignore
   primitive VM-Step1 Delay params startdelay=5 stopdelay=2 mondelay=0
   primitive VM-Step2a Delay params startdelay=5 stopdelay=2 mondelay=0
   primitive VM-Step2b Delay params startdelay=5 stopdelay=2 mondelay=0
   primitive VM-Step3 Delay params startdelay=5 stopdelay=2 mondelay=0
   primitive VM-Step0 Delay params startdelay=5 stopdelay=2 mondelay=0
   order vm-startorder Optional: \
       ( VM-Step0 ) ( VM-Step1 ) ( VM-Step2a VM-Step2b ) ( VM-Step3 ) \
       symmetrical=false
   EOF

To easily watch the order of resources starting, using "watch" is very
convenient:

   watch -n 0,5 crm status

Notably, "crm configure show" prints a sorted list with "Step0" as the
first primitive, while "crm status" maintains the original order of
adding the resources, with "Step0" as the last resource. In this simple
setup, both

   crm resource restart "*"
   systemctl restart pacemaker

correctly start resources according to the advisory vm-startorder
constraint. Now adding dummy DRBD resources, including their
master-slave-clone, colocation and mandatory order constraints:

   for res in Step1 Step2a Step2b Step3 Step0 ; do
   cat << EOF
   primitive DRBD-$res ocf:pacemaker:Stateful
   ms ms-$res DRBD-$res \
       meta clone-max=2 clone-node-max=1 \
       master-max=1 master-node-max=1 notify=true
   colocation coloc.$res-vm-with-drbd.Master inf: VM-$res ms-$res:Master
   order order.$res-vm-after-drbd ms-$res:promote VM-$res:start
   EOF
   done | crm configure

Again, crm resource restart "*" sequences the startup according to
the vm-startorder constraint. But with systemctl restart pacemaker,
VM-Step0 starts as the *last* resource. Sometimes, the remaining
parts of the startorder constraint seem to be effective (VM-Step2a
and VM-Step2b do start simultaneously), while on other instances,
Step3 and Step0 start concurrently.

All in all, this might be a big race condition, as the DRBD resources
need to be started prior to starting the associated VM. My guess would
have been that, at least in this simple test case, all promotes would
succeed in much less than one second, so effectively start all VMs
simultaneously.

Looking at the logfiles of the "crm resource restart" case, we see that
all DRBD start actions are started in transition 12, and all DRBD
promote and VM start actions follow in transition 13.

Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step1:0	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step2a:0	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step2b:0	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step3:0	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step0:0	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: 
process_pe_message:	Calculated Transition 12: 
/var/lib/pacemaker/pengine/pe-input-193.bz2
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step1	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step2a	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step2b	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step3	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step1:0	(Slave -> Master hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step2a:0	(Slave -> Master hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step2b:0	(Slave -> Master hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step3:0	(Slave -> Master hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step0:0	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: 
process_pe_message:	Calculated Transition 13: 
/var/lib/pacemaker/pengine/pe-input-194.bz2
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step1	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step2a	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step2b	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step3	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step0	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step3:0	(Slave -> Master hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step0:0	(Slave -> Master hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: 
process_pe_message:	Calculated Transition 14: 
/var/lib/pacemaker/pengine/pe-input-195.bz2
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step1	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step2a	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step2b	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step3	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step0	(hanode1)
Dec 17 11:32:15 [27511] hanode1    pengine:   notice: 
process_pe_message:	Calculated Transition 15: 
/var/lib/pacemaker/pengine/pe-input-196.bz2

But in the "systemctl restart pacemaker" case, the DRBD start actions
are all in transition 0. Only VM-Step1 and VM-Step2a follow in
transition 1, with the other VM transitions following in subsequent
transitions. The VM resources appear in the transitions in the same
order as they were added to the CIB.

Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step1:0	(hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step2a:0	(hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step2b:0	(hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step3:0	(hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step0:0	(hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: 
process_pe_message:	Calculated Transition 0: 
/var/lib/pacemaker/pengine/pe-input-181.bz2
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step1	(hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step2a	(hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step1:0	(Slave -> Master hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step2a:0	(Slave -> Master hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step2b:0	(hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step3:0	(hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step0:0	(hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: 
process_pe_message:	Calculated Transition 1: 
/var/lib/pacemaker/pengine/pe-input-182.bz2
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step1	(hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step2a	(hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step2b	(hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step2a:0	(Slave -> Master hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step2b:0	(Slave -> Master hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step3:0	(hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step0:0	(hanode1)
Dec 17 11:31:19 [27511] hanode1    pengine:   notice: 
process_pe_message:	Calculated Transition 2: 
/var/lib/pacemaker/pengine/pe-input-183.bz2
Dec 17 11:31:24 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step2a	(hanode1)
Dec 17 11:31:24 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step2b	(hanode1)
Dec 17 11:31:24 [27511] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step2b:0	(Slave -> Master hanode1)
Dec 17 11:31:24 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step3:0	(hanode1)
Dec 17 11:31:24 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step0:0	(hanode1)
Dec 17 11:31:24 [27511] hanode1    pengine:   notice: 
process_pe_message:	Calculated Transition 3: 
/var/lib/pacemaker/pengine/pe-input-184.bz2
Dec 17 11:31:29 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step2b	(hanode1)
Dec 17 11:31:29 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step3	(hanode1)
Dec 17 11:31:29 [27511] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step2b:0	(Slave -> Master hanode1)
Dec 17 11:31:29 [27511] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step3:0	(Slave -> Master hanode1)
Dec 17 11:31:29 [27511] hanode1    pengine:   notice: LogActions:	Start 
  DRBD-Step0:0	(hanode1)
Dec 17 11:31:29 [27511] hanode1    pengine:   notice: 
process_pe_message:	Calculated Transition 4: 
/var/lib/pacemaker/pengine/pe-input-185.bz2
Dec 17 11:31:29 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step2b	(hanode1)
Dec 17 11:31:29 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step3	(hanode1)
Dec 17 11:31:29 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step0	(hanode1)
Dec 17 11:31:29 [27511] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step0:0	(Slave -> Master hanode1)
Dec 17 11:31:29 [27511] hanode1    pengine:   notice: 
process_pe_message:	Calculated Transition 5: 
/var/lib/pacemaker/pengine/pe-input-186.bz2
Dec 17 11:31:34 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step3	(hanode1)
Dec 17 11:31:34 [27511] hanode1    pengine:   notice: LogActions:	Start 
  VM-Step0	(hanode1)
Dec 17 11:31:34 [27511] hanode1    pengine:   notice: 
process_pe_message:	Calculated Transition 6: 
/var/lib/pacemaker/pengine/pe-input-187.bz2

Double-checking with pacemaker 1.1.18-2ubuntu1.19.04.1 on Ubuntu 19.04
showed a different behaviour, which seems to be triggered by the number
of CPUs (was 2 in the first case, now 4), not the pacemaker version:
Doubling the number of resources again shows the same pacemaker restart
behaviour.

One more try: using the doubled amount of resources on the older Ubuntu
system (release 16.04 with only 2 cores), even the crm res restart "*"
operation breaks the advisory startorder constraint.

Looking for a way to optimize the setup in order to serialize the
restart operation I was searching for means of convincing the policy
engine to trigger all start events in the same transition. Changing the
batch-limit did not show any effect in my tests. I then realized that
crmd has an option "node-action-limit" which defaults to twice the
number of cores, according to the manpage. So increasing it to some
randomly high value (and restarting pacemaker to make sure the value is
actually in effect):

   crm configure property node-action-limit=100
   systemctl restart pacemaker

Now the startorder constraints are all fulfilled. This is also
reflected by the logfile with all VM start events in one transition:

Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   DRBD-Step1:0    (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   DRBD-Step2a:0   (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   DRBD-Step2b:0   (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   DRBD-Step3:0    (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   DRBD-Step4:0    (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   DRBD-Step5:0    (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   DRBD-Step6a:0   (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   DRBD-Step6b:0   (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   DRBD-Step7:0    (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   DRBD-Step0:0    (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: 
process_pe_message:   Calculated Transition 0: 
/var/lib/pacemaker/pengine/pe-input-697.bz2
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step1        (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step2a       (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step2b       (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step3        (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step4        (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step5        (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step6a       (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step6b       (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step7        (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step0        (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step1:0    (Slave -> Master hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step2a:0   (Slave -> Master hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step2b:0   (Slave -> Master hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step3:0    (Slave -> Master hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step4:0    (Slave -> Master hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step5:0    (Slave -> Master hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step6a:0   (Slave -> Master hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step6b:0   (Slave -> Master hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step7:0    (Slave -> Master hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Promote DRBD-Step0:0    (Slave -> Master hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: 
process_pe_message:   Calculated Transition 1: 
/var/lib/pacemaker/pengine/pe-input-698.bz2
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step1        (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step2a       (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step2b       (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step3        (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step4        (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step5        (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step6a       (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step6b       (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step7        (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: LogActions: 
Start   VM-Step0        (hanode1)
Dec 18 15:07:08 [10676] hanode1    pengine:   notice: 
process_pe_message:   Calculated Transition 2: 
/var/lib/pacemaker/pengine/pe-input-699.bz2

In a real world setup, we would propably have to calculate the limit or
use an insanely big number. Now some questions remain:

* Is this the only / suggested way to have advisory startorder
   constraints work as expected, when the number of resources affected
   exceeds twice the number of cores?
* I would guess this has an impact on throttling. Maybe setting the
   batch-limit to something like twice or four times the number of
   cores would roughly compensate?

Back to the real-world test environment

In our real-world test environment, increasing the node-action-limit
made the startorder constraint basically effective. After some steps,
the order is still correct, but we see the startup of some resources
serialized that should start simultaneously. The first 19 VM resources
showed exactly the expected order, including parallelisms, but of the
remaining 16 VM resources no two would start concurrently, they were all
serialized.

After pacemaker restart, we have Transition 0 with the DRBD actions,
followed 4s later with Transition 1 including all VM actions with
correct ordering. 32s later, there is Transition 2 with fewer actions,
but no new ones. Looking at the dot file generated by crm_simulate, the
order should still be defined correctly. Any idea what could cause this
effect? Which part of the logfiles should I look at? Unfortunately, I
was not able to create a simple setup to reproduce this.

Achim Leitner