[Pacemaker] can we update an attribute with cmpxchg "atomic compare and exchange" semantics?

Wed Sep 10 05:50:58 EDT 2014

Hi Andrew (and others).

For a certain use case (yes, I'm talking about DRBD "peer-fencing" on
loss of replication link), it would be nice to be able to say:

  update some_attribute=some_attribute+1 where some_attribute >= 0

  delete some_attribute where some_attribute=0

Ok, that's not the classic cmpxchg(), more of an atomic_add();
or similar enough. With hopefully just a single cib roundrip.

Let me rephrase:
Update attribute "this_is_pink" (for node-X with ID attr-ID):

  fail if said attr-ID exists elsewhere (not as the intended attribute
  at the intended place in the xml tree)
	(this comes for free already, I think)

  if it does not exist at all, assume it was present with current value 0

  if the current (or assumed current) value is >= 0, add 1

  if the current value is < 0, fail

  (optionally: return new value? old value?)

My intended use case scenario is this:

  Two DRBD nodes, several DRBD resources,
  at least a few of them in "dual-primary".

  Replication link breaks.

  Fence-peer handlers are triggered individually for each resource on
  both nodes, and try to concurrently modify the cib (place fencing
  constraints).

With the current implementation of crm-fence-peer.sh, it is likely that
some DRBD resources "win" on one node, some "win" on the other node.
The respective losers will have their IO blocked.

Which means that most likely on both nodes some DRBD will stay blocked,
some monitor operation will soon fail, some stop operation (to recover
from the monitor fail) will soon fail, and the recovery of that will be
node-level fencing of the affected node.

In short: both nodes will be hard-reset
because of a replication link failure.

If I would instead use a single attribute (with a pre-determined ID) for all
instances of the fence-peer handler, the first to come would "chose" the
victim node, all others would just add their count.
There will be only one loser, and more importantly: one survivor.

Once the replication link is re-established,
DRBD resynchronization will bring the former loser up-to-date,
and the respective after-resync handlers will decrease that "breakage
count". Once the breakage count hits zero, it can and should be deleted.

Presence of the "breakage count" attribute with value > 0 would mean
"this node must not be promoted", which would be a static constraint
to be added to all DRBD resources.

Does that make sense?

(I have more insane proposals, in case we have multiple (more than 2)
 Primaries during normal operation, but I'm not yet able to write them
 down without being seriously confused by myself...)

I could open-code it with shell and cibadmin, btw.
I did a proof-of-concept once that does
  a. cibadmin -Q
  b. some calculations,
     then prepares the update statement xml based on cib content seen,
     *including* the cib generation counters
  c. cibadmin -R (or -C, -M, -D, as appropriate)
     this will fail if the cib was modified in a relevant way since a,
     because of the included generation counters
  d. repeat as necessary

But that is beyond ugly.
And probably fragile.
And would often fail for all the wrong reasons, just because some status
code has changed and bumped the cib generation counters.

What would be needed to add such functionality?
Where would it go?
cibadmin? cib? crm_attribute? possibly also attrd?

Thanks,
	Lars