[ClusterLabs] Antw: Re: How Pacemaker reacts to fast changes of the same parameter in configuration

Thu Nov 10 09:17:25 UTC 2016

On 11/10/2016 08:27 AM, Ulrich Windl wrote:
>>>> Klaus Wenninger <kwenning at redhat.com> schrieb am 09.11.2016 um 17:42 in
> Nachricht <80c65564-b299-e504-4c6c-afd0ff86e178 at redhat.com>:
>> On 11/09/2016 05:30 PM, Kostiantyn Ponomarenko wrote:
>>> When one problem seems to be solved, another one appears.
>>> Now my script looks this way:
>>>
>>>     crm --wait configure rsc_defaults resource-stickiness=50
>>>     crm configure rsc_defaults resource-stickiness=150
>>>
>>> While now I am sure that transactions caused by the first command
>>> won't be aborted, I see another possible problem here.
>>> With a minimum load in the cluster it took 22 sec for this script to
>>> finish. 
>>> I see here a weakness. 
>>> If a node on which this script is called goes down for any reasons,
>>> then "resource-stickiness" is not set back to its original value,
>>> which is vary bad.
> I don't quite understand: You want your resources to move to their preferred location after some problem. When the node goes down with the lower stickiness, there is no problem while the other node is down; when it comes up, resources might be moved, but isn't that what you wanted?

I guess this is about the general problem with features like e.g. 'move'
as well
that are so much against how pacemaker is working.
They are implemented inside the high-level-tooling.
They are temporarily modifying the CIB and if something happens that makes
this controlling high-level-tool go away it stays as is - or the CIB
even stays
modified and the user has to know that he has to do a manual cleanup.
So we could actually derive a general discussion from that how to handle
these issues in a way that it is less likely to have artefacts persist after
some administrative action.
At the moment e.g. special tagging for the constraints that are
automatically
created to trigger a move  is one approach.
But when would you issue an automatized cleanup? Is there anything
implemented in high-level-tooling? pcsd I guess would be a candidate, for
crmsh I don't know of a persistent instance that could take care of that ...

If we say we won't implement these features in the core of pacemaker
I definitely agree. But is there anything we could do to make it easier
for high-level-tools?
I'm thinking of some mechanism that makes the constraints somehow
magically disappear or disabled when they have achieved what they
were intended to, if the connection to some administrative-shell is
lost, or ...
I could imagine dependency on some token given to a shell, something
like a suicide-timeout, ...
Maybe the usual habit when configuring a switch/router can trigger
some ideas: issue a reboot in x minutes; do a non persistent config-change;
check if everything is fine afterwards; make it persistent; disable
the timed reboot

>
>>> So, now I am thinking of how to solve this problem. I would appreciate
>>> any thoughts about this.
>>>
>>> Is there a way to ask Pacemaker to do these commands sequentially so
>>> there is no need to wait in the script?
>>> If it is possible, than I think that my concern from above goes away.
>>>
>>> Another thing which comes to my mind - is to use time based rules.
>>> This ways when I need to do a manual fail-back, I simply set (or
>>> update) a time-based rule from the script.
>>> And the rule will basically say - set "resource-stickiness" to 50
>>> right now and expire in 10 min.
>>> This looks good at the first glance, but there is no a reliable way to
>>> put a minimum sufficient time for it; at least not I am aware of.
>>> And the thing is - it is important to me that "resource-stickiness" is
>>> set back to its original value as soon as possible.
>>>
>>> Those are my thoughts. As I said, I appreciate any ideas here.
>> Have never tried --wait with crmsh but I would guess that the delay you
>> are observing
>> is really the time your resources are taking to stop and start somewhere
>> else.
>>
>> Actually you would need the reduced stickiness just during the stop
>> phase - right.
>>
>> So as there is no command like "wait till all stops are done" you could
>> still
>> do the 'crm_simulate -Ls' and check that it doesn't want to stop
>> anything anymore.
>> So you can save the time the starts would take.
>> Unfortunately you have to repeat that and thus put additional load on
>> pacemaker
>> possibly slowing down things if your poll-cycle is to short.
>>
>>>
>>> Thank you,
>>> Kostia
>>>
>>> On Tue, Nov 8, 2016 at 10:19 PM, Dejan Muhamedagic
>>> <dejanmm at fastmail.fm <mailto:dejanmm at fastmail.fm>> wrote:
>>>
>>>     On Tue, Nov 08, 2016 at 12:54:10PM +0100, Klaus Wenninger wrote:
>>>     > On 11/08/2016 11:40 AM, Kostiantyn Ponomarenko wrote:
>>>     > > Hi,
>>>     > >
>>>     > > I need a way to do a manual fail-back on demand.
>>>     > > To be clear, I don't want it to be ON/OFF; I want it to be
>>>     more like
>>>     > > "one shot".
>>>     > > So far I found that the most reasonable way to do it - is to set
>>>     > > "resource stickiness" to a different value, and then set it
>>>     back to
>>>     > > what it was.
>>>     > > To do that I created a simple script with two lines:
>>>     > >
>>>     > >     crm configure rsc_defaults resource-stickiness=50
>>>     > >     crm configure rsc_defaults resource-stickiness=150
>>>     > >
>>>     > > There are no timeouts before setting the original value back.
>>>     > > If I call this script, I get what I want - Pacemaker moves
>>>     resources
>>>     > > to their preferred locations, and "resource stickiness" is set
>>>     back to
>>>     > > its original value.
>>>     > >
>>>     > > Despite it works, I still have few concerns about this approach.
>>>     > > Will I get the same behavior under a big load with delays on
>>>     systems
>>>     > > in cluster (which is truly possible and a normal case in my
>>>     environment)?
>>>     > > How Pacemaker treats fast change of this parameter?
>>>     > > I am worried that if "resource stickiness" is set back to its
>>>     original
>>>     > > value to fast, then no fail-back will happen. Is it possible, or I
>>>     > > shouldn't worry about it?
>>>     >
>>>     > AFAIK pengine is interrupted when calculating a more complicated
>>>     transition
>>>     > and if the situation has changed a transition that is just being
>>>     executed
>>>     > is aborted if the input from pengine changed.
>>>     > So I would definitely worry!
>>>     > What you could do is to issue 'crm_simulate -Ls' in between and
>>>     grep for
>>>     > an empty transition.
>>>     > There might be more elegant ways but that should be safe.
>>>
>>>     crmsh has an option (-w) to wait for the PE to settle after
>>>     committing configuration changes.
>>>
>>>     Thanks,
>>>
>>>     Dejan
>>>     >
>>>     > > Thank you,
>>>     > > Kostia
>>>     > >
>>>     > >
>>>     > > _______________________________________________
>>>     > > Users mailing list: Users at clusterlabs.org 
>>>     <mailto:Users at clusterlabs.org>
>>>     > > http://clusterlabs.org/mailman/listinfo/users 
>>>     <http://clusterlabs.org/mailman/listinfo/users>
>>>     > >
>>>     > > Project Home: http://www.clusterlabs.org 
>>>     > > Getting started:
>>>     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>     <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>>>     > > Bugs: http://bugs.clusterlabs.org 
>>>     >
>>>     >
>>>     >
>>>     > _______________________________________________
>>>     > Users mailing list: Users at clusterlabs.org 
>>>     <mailto:Users at clusterlabs.org>
>>>     > http://clusterlabs.org/mailman/listinfo/users 
>>>     <http://clusterlabs.org/mailman/listinfo/users>
>>>     >
>>>     > Project Home: http://www.clusterlabs.org 
>>>     > Getting started:
>>>     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>     <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>>>     > Bugs: http://bugs.clusterlabs.org 
>>>
>>>     _______________________________________________
>>>     Users mailing list: Users at clusterlabs.org 
>>>     <mailto:Users at clusterlabs.org>
>>>     http://clusterlabs.org/mailman/listinfo/users 
>>>     <http://clusterlabs.org/mailman/listinfo/users>
>>>
>>>     Project Home: http://www.clusterlabs.org 
>>>     Getting started:
>>>     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>     <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>>>     Bugs: http://bugs.clusterlabs.org 
>>>
>>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org