[ClusterLabs] Issues found in Pacemaker 1.1.18, fixes in 1.1 branch

Mon Dec 11 22:10:35 EST 2017

FYI:

A couple of regressions have been found in the recent Pacemaker 1.1.18
release.

Fixes for these, plus one finishing an incomplete fix in 1.1.18, are in
the master branch, and have been backported to the 1.1 branch for ease
of patching. It is recommended that anyone compiling or packaging
1.1.18 include all the commits from the 1.1 branch.

The fixed issues are:

* 1.1.18 improved scalability by eliminating redundant node attribute
write-outs. This proved to be too aggressive in one case: when a
cluster is configured with only IP addresses (no node names) in
corosync.conf, attribute write-outs can be incorrectly delayed; in the
worst case, this prevents a node from shutting down due to the shutdown
attribute not being written.

* 1.1.18 overhauled unfencing in order to support it on remote nodes.
(Unfencing is for devices such as fence_scsi that require a fenced node
to be explicitly re-admitted to the cluster.) This made the faulty
assumption that the fence devices themselves could operate before
unfencing happened. As a result, a cluster with unfencing could see
unnecessary fence device monitoring failures (these do not harm the
cluster's ability to fence or unfence).

* 1.1.18 implemented ordering constraints for the new bundle resource
type. This had a corner case that could lead to an invalid transition.
As part of the fix for this, we have included a fix for an issue
discussed in an earlier thread on this list ("pcmk_remote evaluation"),
so the cluster will always prefer the newest Pacemaker Remote
connection to a remote node, even if an older (dead) connection has not
yet timed out.
-- 
Ken Gaillot <kgaillot at redhat.com>