[ClusterLabs] Antw: Re: Pacemaker 2.0.3-rc3 now available

Mon Nov 18 11:45:25 EST 2019

On Fri, 2019-11-15 at 14:35 +0100, Jehan-Guillaume de Rorthais wrote:
> On Thu, 14 Nov 2019 11:09:57 -0600
> Ken Gaillot <kgaillot at redhat.com> wrote:
> 
> > On Thu, 2019-11-14 at 15:22 +0100, Ulrich Windl wrote:
> > > > > > Jehan-Guillaume de Rorthais <jgdr at dalibo.com> schrieb am
> > > > > > 14.11.2019 um  
> > > 
> > > 15:17 in
> > > Nachricht <20191114151719.6cbf4e38 at firost>:  
> > > > On Wed, 13 Nov 2019 17:30:31 ‑0600
> > > > Ken Gaillot <kgaillot at redhat.com> wrote:
> > > > ...  
> > > > > A longstanding pain point in the logs has been improved.
> > > > > Whenever
> > > > > the
> > > > > scheduler processes resource history, it logs a warning for
> > > > > any
> > > > > failures it finds, regardless of whether they are new or old,
> > > > > which can
> > > > > confuse anyone reading the logs. Now, the log will contain
> > > > > the
> > > > > time of
> > > > > the failure, so it's obvious whether you're seeing the same
> > > > > event
> > > > > or
> > > > > not. The log will also contain the exit reason if one was
> > > > > provided by
> > > > > the resource agent, for easier troubleshooting.  
> > > > 
> > > > I've been hurt by this in the past and I was wondering what was
> > > > the
> > > > point of
> > > > warning again and again in the logs for past failures during
> > > > scheduling? 
> > > > What this information brings to the administrator?  
> > 
> > The controller will log an event just once, when it happens.
> > 
> > The scheduler, on the other hand, uses the entire recorded resource
> > history to determine the current resource state. Old failures (that
> > haven't been cleaned) must be taken into account.
> 
> OK, I wasn't aware of this. If you have a few minutes, I would be
> interested to
> know why the full history is needed and not just find the latest
> entry from
> there. Or maybe there's some comments in the source code that already
> cover this question?

The full *recorded* history consists of the most recent operation that
affects the state (like start/stop/promote/demote), the most recent
failed operation, and the most recent results of any recurring
monitors.

For example there may be a failed monitor, but whether the resource is
considered failed or not would depend on whether there was a more
recent successful stop or start. Even if the failed monitor has been
superseded, it needs to stay in the history for display purposes until
the user has cleaned it up.

> > Every run of the scheduler is completely independent, so it doesn't
> > know about any earlier runs or what they logged. Think of it like
> > Frosty the Snowman saying "Happy Birthday!" every time his hat is
> > put
> > on.
> 
> I don't have this ref :)

I figured not everybody would, but it was too fun to pass up :)

The snowman comes to life every time his magic hat is put on, but to
him each time feels like he's being born for the first time, so he says
"Happy Birthday!"

https://www.youtube.com/watch?v=1PbWTEYoN8o

> > As far as each run is concerned, it is the first time it's seen the
> > history. This is what allows the DC role to move from node to node,
> > and
> > the scheduler to be run as a simulation using a saved CIB file.
> > 
> > We could change the wording further if necessary. The previous
> > version
> > would log something like:
> > 
> > warning: Processing failed monitor of my-rsc on node1: not running
> > 
> > and this latest change will log it like:
> > 
> > warning: Unexpected result (not running: No process state file
> > found)
> > was recorded for monitor of my-rsc on node1 at Nov 12 19:19:02 2019
> 
> /result/state/ ?

It's the result of a resource agent action, so it could be for example
a timeout or a permissions issue.

> > I wanted to be explicit about the message being about processing
> > resource history that may or may not be the first time it's been
> > processed and logged, but everything I came up with seemed too long
> > for
> > a log line. Another possibility might be something like:
> > 
> > warning: Using my-rsc history to determine its current state on
> > node1:
> > Unexpected result (not running: No process state file found) was
> > recorded for monitor at Nov 12 19:19:02 2019
> 
> I better like the first one.
> 
> However, it feels like implementation details exposed to the world,
> isn't it? How useful is this information for the end user? What the
> user can do
> with this information? There's noting to fix and this is not actually
> an error
> of the current running process.
> 
> I still fail to understand why the scheduler doesn't process the
> history
> silently, whatever it finds there, then warn for something really
> important if
> the final result is not expected...

>From the scheduler's point of view, it's all relevant information that
goes into the decision making. Even an old failure can cause new
actions, for example if quorum was not held at the time but has now
been reached, or if there is a failure-timeout that just expired. So
any failure history is important to understanding whatever the
scheduler says needs to be done.

Also, the scheduler is run on the DC, which is not necessarily the node
that executed the action. So it's useful for troubleshooting to present
a picture of the whole cluster on the DC, rather than just what's the
situation on the local node.

I could see an argument for lowering it from warning to notice, but
it's a balance between what's most useful during normal operation and
what's most useful during troubleshooting.

> Regards,
> 
-- 
Ken Gaillot <kgaillot at redhat.com>