[ClusterLabs] What's a "transition", BTW?

Mon Jan 18 13:29:55 EST 2021

On Fri, 2021-01-15 at 11:40 +0100, Ulrich Windl wrote:
> Hi!
> 
> With a cluster recheck interval, I see periodic log messages like
> this:
> Jan 15 11:05:50 h19 pacemaker-controld[4804]:  notice: State
> transition S_TRANSITION_ENGINE -> S_IDLE
> Jan 15 11:15:50 h19 pacemaker-controld[4804]:  notice: State
> transition S_IDLE -> S_POLICY_ENGINE

The "transition" terminology is a little confusing. Note that the above
uses of it are just in the normal sense, i.e. the controller state
changed.

The controller uses a finite state machine to keep track of what it's
doing now and next. Going from "transition engine" to "idle" means it
finished whatever needed to be done in that transition (in the more
technical Pacemaker sense). Going from "idle" to "police engine" means
it is ready to re-invoke the scheduler to re-check whether anything
needs to be done.

> Jan 15 11:15:50 h19 pacemaker-schedulerd[4803]:  notice: Watchdog
> will be used via SBD if fencing is required and stonith-watchdog-
> timeout is nonzero
> Jan 15 11:15:50 h19 pacemaker-schedulerd[4803]:  notice: Calculated
> transition 596, saving inputs in /var/lib/pacemaker/pengine/pe-input-
> 41.bz2
> Jan 15 11:15:50 h19 pacemaker-controld[4804]:  notice: Processing
> graph 596 (ref=pe_calc-dc-1610705750-978) derived from
> /var/lib/pacemaker/pengine/pe-input-41.bz2
> Jan 15 11:15:50 h19 pacemaker-controld[4804]:  notice: Transition 596
> (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-41.bz2): Complete
> 
> The "transition" number increases each time, while there is visible
> no action to be performed. So what's in such a "transition"? Couldn't
> the cluster skip those lines if there's nothing to do?
> 
> Regards,
> Ulrich

"Transition" as Pacemaker uses it in a technical sense is what you
called in a different post an "action plan". A transition is all
actions needed to bring the cluster to the desired state (as defined by
the configuration), given everything known about the cluster at the
moment (represented by the complete CIB including configuration and
status).

The controller starts a new transition whenever something interesting
happens (like a resource monitor failure), when a transition action
returns an unexpected result (like a start failing instead of
succeeding), and periodically (according to cluster-recheck-interval).

In any case, it's possible there's nothing to do, so the transition has
no actions. It's still a record that the cluster checked whether
anything needed to be done, and decided no. I have considered lowering
the log message to info level in that case, though -- that probably
makes sense.
-- 
Ken Gaillot <kgaillot at redhat.com>