[ClusterLabs] Antw: Re: Pacemaker 2.0.3-rc3 now available

Thu Nov 14 12:09:57 EST 2019

On Thu, 2019-11-14 at 15:22 +0100, Ulrich Windl wrote:
> > > > Jehan-Guillaume de Rorthais <jgdr at dalibo.com> schrieb am
> > > > 14.11.2019 um
> 
> 15:17 in
> Nachricht <20191114151719.6cbf4e38 at firost>:
> > On Wed, 13 Nov 2019 17:30:31 ‑0600
> > Ken Gaillot <kgaillot at redhat.com> wrote:
> > ...
> > > A longstanding pain point in the logs has been improved. Whenever
> > > the
> > > scheduler processes resource history, it logs a warning for any
> > > failures it finds, regardless of whether they are new or old,
> > > which can
> > > confuse anyone reading the logs. Now, the log will contain the
> > > time of
> > > the failure, so it's obvious whether you're seeing the same event
> > > or
> > > not. The log will also contain the exit reason if one was
> > > provided by
> > > the resource agent, for easier troubleshooting.
> > 
> > I've been hurt by this in the past and I was wondering what was the
> > point
> 
> of
> > warning again and again in the logs for past failures during
> > scheduling? 
> > What
> > this information brings to the administrator?

The controller will log an event just once, when it happens.

The scheduler, on the other hand, uses the entire recorded resource
history to determine the current resource state. Old failures (that
haven't been cleaned) must be taken into account.

Every run of the scheduler is completely independent, so it doesn't
know about any earlier runs or what they logged. Think of it like
Frosty the Snowman saying "Happy Birthday!" every time his hat is put
on. As far as each run is concerned, it is the first time it's seen the
history. This is what allows the DC role to move from node to node, and
the scheduler to be run as a simulation using a saved CIB file.

We could change the wording further if necessary. The previous version
would log something like:

warning: Processing failed monitor of my-rsc on node1: not running

and this latest change will log it like:

warning: Unexpected result (not running: No process state file found)
was recorded for monitor of my-rsc on node1 at Nov 12 19:19:02 2019

I wanted to be explicit about the message being about processing
resource history that may or may not be the first time it's been
processed and logged, but everything I came up with seemed too long for
a log line. Another possibility might be something like:

warning: Using my-rsc history to determine its current state on node1:
Unexpected result (not running: No process state file found) was
recorded for monitor at Nov 12 19:19:02 2019

> > In my humble opinion, any entry in the log file should be about
> > something
> > happening by the time the message appears. And it should appears
> > only once,
> > not
> > repeated again and again for no (appearing) reasons. At least, most
> > of the
> > time. Do I miss something?
> > 
> > I'm sure these historical failure warnings raised by the scheduler
> > have
> 
> been
> > already raised in the past by either the lrm or crm process in most
> > of the
> > cases, aren't them?
> > 
> > Unless I'm not aware of something else, the scheduler might warn
> > about 
> > current
> > unexpected status of a resource, not all of them in the past.
> > 
> > Could you shed some lights on this mystery from the user point of
> > view?
> 
> Hi!
> 
> I can agree that the current pacemaker of SLES12 logs so much while
> virtually
> doing nothing that it's very hard to find out when pacemaker actually
> does
> something. And if it does something, it seems it's announcing the
> same thing at
> least three times before actually really doing anything.
> 
> Regards,
> Ulrich

Part of the difficulty arises from Pacemaker's design of using multiple
independent daemons to handle different aspects of cluster management.
A single failure event might get logged by the executor (lrmd),
controller (crmd), and scheduler (pengine), but in different contexts (
potentially on different nodes).

Improving the logs is a major focus of new releases, and we're always
looking for specific suggestions as to which messages need the most
attention. There's been a lot of progress between 1.1.14 and 2.0.3, but
it takes a while for that to land in distributions.
-- 
Ken Gaillot <kgaillot at redhat.com>