[ClusterLabs] Pacemaker auto restarts disabled groups

Mon Nov 12 12:09:29 EST 2018

On 09/11/18 13:24 +0000, Ian Underhill wrote:
> Yep all my pcs commands run on a live cluster. The design needs
> resources to respond in specific ways before moving on to other
> shutdown requests.
> 
> So it seems that these pcs commands that run on different nodes at
> the same time, is the route cause of this issue, anything that
> changes the live cib at the same time seems to cause pacemaker to
> just skip\throw away actions that that have been requested.

Indeed, that's the tax for a soft split-brain on the administrative
side, which is not very solvable until the pacemaker's database plus
transitions (or whatever higher-level like pcs, but then, exlusivity
of this higher level is not enforcable in any way so it could alway
be circumvented) will become, in a lock-step, a cluster-wide
serializing, ACID-guaranteeing bottle-neck, with all the rechecks
like "am I still quorate" mixed in.  That's because, on the
contrary, pacemaker took the other design path, making for a fast
dynamic adaptations, at the cost of requiring "one-headed" (or more
softly, not colliding on the eventual decisions) control so as not
to compromise generating deterministic, timely outcomes.

Sadly, people are not buying that, applying various antithetical
approaches, like with parallelized Ansible (meaning that the
concept of idempotence was lost somewhere in the naive assumptions
of a static [vs. very dynamic] system without a peer-shared state
[vs. shared state manipulated from whichever end-point]):

https://lists.clusterlabs.org/pipermail/users/2018-June/015133.html

and hence walking on thin ice of "split-brain" management that may
backfire at some point (a lot less tangible than real split brain
from the perspective of shared R/W resources, still with some risks
nonetheless).

Anyway, anyone is welcome to implement said (possibly bypassable as
mentioned) cluster-wide transaction-like (with provable preconditions)
semantics of "critical section" on their own in order to effectively
"deterministicize" the outcomes.  Or perhaps more easily, to avoid
administrative split-brains somewhere higher in the control logic
if at all possible.  Or to learn to live with said downside of the
current arrangement and to not act as if the transaction semantic
and end-to-end reliability was ever guaranteed (again, looking at
naive cluster-wide parallelized automation solutions).

> I have to admit this behaviour is very hard to work with. though in
> a simple system using a shadow cib would avoid these issues, that
> would suggest a central point of control anyway.
> 
> Luckily I have/can redesigned my approach to bring all the commands that
> affect the live cib (on cluster shutdown\startup) to be run from a
> single node within the cluster. (and added --waits to commands where
> possible)
> 
> This approach removes all these issues, and things behave as expected.

Glad to hear it was surmountable in this case!

-- Jan

> Date: Thu, 08 Nov 2018 10:58:52 -0600
> From: Ken Gaillot
> Message-ID: <1541696332.5197.3.camel at redhat.com>
>
> On Thu, 2018-11-08 at 12:14 +0000, Ian Underhill wrote:
>> seems this issue has been raised before, but has gone quite, with no
>> solution
>>
>> https://lists.clusterlabs.org/pipermail/users/2017-October/006544.html
>
> In that case, something appeared to be explicitly re-enabling the
> disabled resources. You can search your logs for "target-role" to see
> whether that's happening.
>
>> I know my resource agents successfully return the correct status to
>> the start\stop\monitor requests
>>
>> On Thu, Nov 8, 2018 at 11:40 AM Ian Underhill <ianpunderhill at gmail.com>
>> wrote:
>>> Sometimes Im seeing that a resource group that is in the process of
>>> being disable is auto restarted by pacemaker.?
>>>
>>> When issuing pcs disable command to disable different resource
>>> groups at the same time (on different nodes, at the group level)
>>> the result is that sometimes the resource is stopped and restarted
>>> straight away. i'm using a balanced placement strategy.
>
> The first thing that comes to mind is that if you're running pcs on the
> live cluster, it won't actually be at the same time, there will be a
> small amount of time between each disable. The cluster could well
> decide to rebalance and thus restart other resource groups that haven't
> yet been disabled.
>
> A way around that would be to run pcs on a file instead and push that
> to the live cluster:
>
>  pcs cluster cib whatever.xml
>  pcs -f whatever.xml ...whatever command you want...
>  ...
>  pcs cluster cib-push whatever.xml --config
>
> That would make all the disabling happen at the same time.
>
>>>
>>> looking into the daemon log, pacemaker is aborting transtions due
>>> to config change of the meta attributes of target-role changing?
>>>
>>> Transition 2838 (Complete=25, Pending=0, Fired=0, Skipped=3,
>>> Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-704.bz2):
>>> Stopped
>>>
>>> could somebody explain Complete/Pending/Fired/Skipped/Incomplete
>>> and is there a way of displaying Skipped actions?
>
> It's almost never useful to end users, and barely more useful even to
> developers. If you pass -VVVVVV to crm_simulate, you could get more
> info, but trust me you don't want to do that. ;-)
>
> Each transition is a set of actions needed to get to the desired state.
> "Complete" are actions that were initiated and a result was received.
> "Pending" are actions that were initiated but the result hasn't come
> back yet. "Skipped" is for certain failure situations, and for when a
> transition is aborted and an action that would be scheduled is a lower
> priority than the abort (which is probably what happened here, nothing
> significant). "Incomplete" is for actions that haven't been initiated
> yet.
>
>
>>> ive used crm_simulate --xml-file XXXX -run to see the actions, and
>>> I see this extra start request
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20181112/fdbfe9e8/attachment-0002.sig>