[ClusterLabs] Antw: Re: Antw: [EXT] Coming in 2.1.3: node health monitoring improvements

Thu Apr 14 01:57:29 EDT 2022

Ken,

thanks for thje explanations! Maybe it would be best (next time) if you
present the documentation for a new feature first (as a base for discussion),
and _then_ implement it.
I know: People first implement it, and later, if they have time or feel like
it, they'll document.
However, as I found out for myself, sometimes documentation is really useful
when you review your code some time later and wonder: "What should have been
the purpose of all that?" ;-)

Regards,
Ulrich

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 13.04.2022 um 15:59 in
Nachricht
<3ad20a26a4623d2e7ff11eb0bdf822faae1a5114.camel at redhat.com>:
> On Wed, 2022-04-13 at 08:22 +0200, Ulrich Windl wrote:
>> > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 12.04.2022 um
>> > > > 17:22 in
>> Nachricht
>> <33f4147d0f6a3e46581aaa46a4eca81dfa59ce15.camel at redhat.com>:
>> > Hi all,
>> > 
>> > I'm hoping to have the first release candidate for 2.1.3 ready next
>> > week.
>> > 
>> > Pacemaker has long had a feature to monitor node health (CPU usage,
>> > SMART drive errors, etc.) and move resources off degraded nodes:
>> > 
>> > 
> https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/singlehtml/ind

>> > ex.html#tracking‑node‑health
>> 
>> Great, I wanted to ask a question on it anyway:
>> Is the node health attribute stored in the CIB, or is it transient
>> (i.e.:
>> reset when the node is restarted)?
> 
> They can be either, although transient makes more sense. As long as the
> name starts with "#health" it will be treated as a health attribute.
> 
>> 
>> Some comments on the docs:
>> 
>> "yellow" state: could also mean node is becoming healthy (coming from
>> red),
>> right?
> 
> True, I'll make a note to update that
> 
>> 
>> The "Node Health Strategy" could benefit from  better explanation.
>> E.g.: "Assign the value of ..." Assign to whom/what?
> 
> The wording could definitely be improved.
> 
> In this case, the idea is that "red", "yellow", and "green" are just
> convenient names for particular integer scores. The actual values used
> depend on the strategy, hence "assign ... to red" and so forth.
> 
>> It's very hard to find out what "progressive" really does.
>> 
>> I think an configuration example with a sample scenario (node health
>> changes)
>> would be very helpful.
> 
> Yes progressive and custom are confusing without examples. I'll add it
> to the to-do list ...
> 
> The idea behind progressive is that you might want to give a negative
> but not infinite preference to yellow and/or red. With the other
> strategies, any red attribute will cause all resources to move off.
> With progressive, you could set red to some number (say -100) and that
> score would be used just as if you had configured a location constraint
> with that score. If you had stickiness higher than that, that would
> keep any existing resources running there, but prevent any new
> resources from being moved to the node.
> 
>> 
>> > The 2.1.3 release will add a couple of features to make this more
>> > useful.
>> > 
>> > First, you can now exempt particular resources from health‑related
>> > bans, using the new "allow‑unhealthy‑nodes" resource
>> > meta‑attribute.
>> 
>> If that's  a resource attribute, then the name is poorly chosen
>> (IMHO).
>> In times like these I'd almost suggest to name it
>> "immune-against-node-health=red" or so (OK, just a joke).
> 
> I always agonize over the names :)
> 
> What I really wanted was to use the existing "requires" meta-attribute. 
> It currently can be set to nothing, quorum, fencing, or unfencing, to
> determine what conditions have to be in place for the resource to run
> (the default of fencing means that the cluster partition must have
> quorum and any unclean nodes must have been successfully fenced).
> 
> It would have been nice to have requires="fencing,health" mean that the
> resource can only run on a healthy node (as defined by the configured
> strategy). Unfortunately that would not have been backward compatible
> with existing explicit configurations.
> 
>> 
>> 
>> > This is particularly helpful for the health monitoring agents
>> > themselves. Without the new option, health agents get moved off
>> 
>> Specifically if the health state can improve again.
>> 
>> > degraded nodes, which means the cluster can't detect if the
>> > degraded
>> > condition goes away. Users had to manually clear the health
>> > attributes
>> > to allow resources to move back to the node. Now, you can set
>> > allow‑
>> > unhealthy‑nodes=true on your health agent resources, so they can
>> > continue detecting changes in the health status.
>> > 
>> > Second, crm_mon will indicate when a node's health is yellow or
>> > red,
>> > like:
>> > 
>> >     * Node List:
>> >         * Node node1: online (health is RED)
>> 
>> For compatibility I'd prefer a new option to display those, or at
>> least a new
>> item; maybe like:
>> ----
>> Node Health:
>>   * Node: h16: green
>>   ...
>> ----
>> 
>> or
>> 
>> ---
>> Node Attributes:
>>   * Node h16: green
>> ---
> 
> You can already list all attributes (including health attributes) with
> the -A / --show-node-attributes option.
> 
>> 
>> > Previously, you would see that the node is not running any
>> > resources,
>> > but not know why, unless you thought to check every node health
>> > attribute.
>> 
>> That's definitely a bad thing for any atrificial intelligence not to
>> be able
>> to explain itself ;-)
>> 
>> Regards,
>> Ulrich
> 
> -- 
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/