[ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

Thu Sep 22 12:08:28 EDT 2016

On 09/22/2016 10:43 AM, Jan Pokorný wrote:
> On 21/09/16 10:51 +1000, Andrew Beekhof wrote:
>> On Wed, Sep 21, 2016 at 6:25 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
>>> Our first proposed approach would add a new hard-fail-threshold
>>> operation property. If specified, the cluster would first try restarting
>>> the resource on the same node,
>>
>>
>> Well, just as now, it would be _allowed_ to start on the same node, but
>> this is not guaranteed.
> 
> Yeah, I should attend doublethink classes to understand "the same
> node" term better:
> 
> https://github.com/ClusterLabs/pacemaker/pull/1146/commits/3b3fc1fd8f2c95d8ab757711cf096cf231f27941

"Same node" is really a shorthand to hand-wave some details, because
that's what will typically happen.

The exact behavior is: "If the fail-count on this node reaches <N>, ban
this node from running the resource."

That's not the same as *requiring* the resource to restart on the same
node before <N> is reached. As in any situation, Pacemaker will
re-evaluate the current state of the cluster, and choose the best node
to try starting the resource on.

For example, consider if the failed resource with on-fail=restart is
colocated with another resource with on-fail=standby that also failed,
then the whole node will be put in standby, and the original resource
will of course move away. It will be restarted, but the start will
happen on another node.

There are endless such scenarios, so "try restarting on the same node"
is not really accurate. To be accurate, I should have said something
like "try restarting without banning the node with the failure".