[ClusterLabs] Antw: [EXT] Coming in 2.1.3: node health monitoring improvements

Ken Gaillot kgaillot at redhat.com
Wed Apr 13 09:59:55 EDT 2022


On Wed, 2022-04-13 at 08:22 +0200, Ulrich Windl wrote:
> > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 12.04.2022 um
> > > > 17:22 in
> Nachricht
> <33f4147d0f6a3e46581aaa46a4eca81dfa59ce15.camel at redhat.com>:
> > Hi all,
> > 
> > I'm hoping to have the first release candidate for 2.1.3 ready next
> > week.
> > 
> > Pacemaker has long had a feature to monitor node health (CPU usage,
> > SMART drive errors, etc.) and move resources off degraded nodes:
> > 
> > https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/singlehtml/ind
> > ex.html#tracking‑node‑health
> 
> Great, I wanted to ask a question on it anyway:
> Is the node health attribute stored in the CIB, or is it transient
> (i.e.:
> reset when the node is restarted)?

They can be either, although transient makes more sense. As long as the
name starts with "#health" it will be treated as a health attribute.

> 
> Some comments on the docs:
> 
> "yellow" state: could also mean node is becoming healthy (coming from
> red),
> right?

True, I'll make a note to update that

> 
> The "Node Health Strategy" could benefit from  better explanation.
> E.g.: "Assign the value of ..." Assign to whom/what?

The wording could definitely be improved.

In this case, the idea is that "red", "yellow", and "green" are just
convenient names for particular integer scores. The actual values used
depend on the strategy, hence "assign ... to red" and so forth.

> It's very hard to find out what "progressive" really does.
> 
> I think an configuration example with a sample scenario (node health
> changes)
> would be very helpful.

Yes progressive and custom are confusing without examples. I'll add it
to the to-do list ...

The idea behind progressive is that you might want to give a negative
but not infinite preference to yellow and/or red. With the other
strategies, any red attribute will cause all resources to move off.
With progressive, you could set red to some number (say -100) and that
score would be used just as if you had configured a location constraint
with that score. If you had stickiness higher than that, that would
keep any existing resources running there, but prevent any new
resources from being moved to the node.

> 
> > The 2.1.3 release will add a couple of features to make this more
> > useful.
> > 
> > First, you can now exempt particular resources from health‑related
> > bans, using the new "allow‑unhealthy‑nodes" resource
> > meta‑attribute.
> 
> If that's  a resource attribute, then the name is poorly chosen
> (IMHO).
> In times like these I'd almost suggest to name it
> "immune-against-node-health=red" or so (OK, just a joke).

I always agonize over the names :)

What I really wanted was to use the existing "requires" meta-attribute. 
It currently can be set to nothing, quorum, fencing, or unfencing, to
determine what conditions have to be in place for the resource to run
(the default of fencing means that the cluster partition must have
quorum and any unclean nodes must have been successfully fenced).

It would have been nice to have requires="fencing,health" mean that the
resource can only run on a healthy node (as defined by the configured
strategy). Unfortunately that would not have been backward compatible
with existing explicit configurations.

> 
> 
> > This is particularly helpful for the health monitoring agents
> > themselves. Without the new option, health agents get moved off
> 
> Specifically if the health state can improve again.
> 
> > degraded nodes, which means the cluster can't detect if the
> > degraded
> > condition goes away. Users had to manually clear the health
> > attributes
> > to allow resources to move back to the node. Now, you can set
> > allow‑
> > unhealthy‑nodes=true on your health agent resources, so they can
> > continue detecting changes in the health status.
> > 
> > Second, crm_mon will indicate when a node's health is yellow or
> > red,
> > like:
> > 
> >     * Node List:
> >         * Node node1: online (health is RED)
> 
> For compatibility I'd prefer a new option to display those, or at
> least a new
> item; maybe like:
> ----
> Node Health:
>   * Node: h16: green
>   ...
> ----
> 
> or
> 
> ---
> Node Attributes:
>   * Node h16: green
> ---

You can already list all attributes (including health attributes) with
the -A / --show-node-attributes option.

> 
> > Previously, you would see that the node is not running any
> > resources,
> > but not know why, unless you thought to check every node health
> > attribute.
> 
> That's definitely a bad thing for any atrificial intelligence not to
> be able
> to explain itself ;-)
> 
> Regards,
> Ulrich

-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list