[ClusterLabs] Antw: Re: Pacemaker 2.0.3-rc3 now available

Fri Nov 15 08:35:45 EST 2019

On Thu, 14 Nov 2019 11:09:57 -0600
Ken Gaillot <kgaillot at redhat.com> wrote:

> On Thu, 2019-11-14 at 15:22 +0100, Ulrich Windl wrote:
> > > > > Jehan-Guillaume de Rorthais <jgdr at dalibo.com> schrieb am
> > > > > 14.11.2019 um  
> > 
> > 15:17 in
> > Nachricht <20191114151719.6cbf4e38 at firost>:  
> > > On Wed, 13 Nov 2019 17:30:31 ‑0600
> > > Ken Gaillot <kgaillot at redhat.com> wrote:
> > > ...  
> > > > A longstanding pain point in the logs has been improved. Whenever
> > > > the
> > > > scheduler processes resource history, it logs a warning for any
> > > > failures it finds, regardless of whether they are new or old,
> > > > which can
> > > > confuse anyone reading the logs. Now, the log will contain the
> > > > time of
> > > > the failure, so it's obvious whether you're seeing the same event
> > > > or
> > > > not. The log will also contain the exit reason if one was
> > > > provided by
> > > > the resource agent, for easier troubleshooting.  
> > > 
> > > I've been hurt by this in the past and I was wondering what was the
> > > point of
> > > warning again and again in the logs for past failures during
> > > scheduling? 
> > > What this information brings to the administrator?  
> 
> The controller will log an event just once, when it happens.
> 
> The scheduler, on the other hand, uses the entire recorded resource
> history to determine the current resource state. Old failures (that
> haven't been cleaned) must be taken into account.

OK, I wasn't aware of this. If you have a few minutes, I would be interested to
know why the full history is needed and not just find the latest entry from
there. Or maybe there's some comments in the source code that already
cover this question?

> Every run of the scheduler is completely independent, so it doesn't
> know about any earlier runs or what they logged. Think of it like
> Frosty the Snowman saying "Happy Birthday!" every time his hat is put
> on.

I don't have this ref :)

> As far as each run is concerned, it is the first time it's seen the
> history. This is what allows the DC role to move from node to node, and
> the scheduler to be run as a simulation using a saved CIB file.
> 
> We could change the wording further if necessary. The previous version
> would log something like:
> 
> warning: Processing failed monitor of my-rsc on node1: not running
> 
> and this latest change will log it like:
> 
> warning: Unexpected result (not running: No process state file found)
> was recorded for monitor of my-rsc on node1 at Nov 12 19:19:02 2019

/result/state/ ?

> I wanted to be explicit about the message being about processing
> resource history that may or may not be the first time it's been
> processed and logged, but everything I came up with seemed too long for
> a log line. Another possibility might be something like:
> 
> warning: Using my-rsc history to determine its current state on node1:
> Unexpected result (not running: No process state file found) was
> recorded for monitor at Nov 12 19:19:02 2019

I better like the first one.

However, it feels like implementation details exposed to the world,
isn't it? How useful is this information for the end user? What the user can do
with this information? There's noting to fix and this is not actually an error
of the current running process.

I still fail to understand why the scheduler doesn't process the history
silently, whatever it finds there, then warn for something really important if
the final result is not expected...

Regards,