[ClusterLabs] Antw: [EXT] Re: A bug? (SLES15 SP2 with "crm resource refresh")

Mon Jan 11 09:46:10 EST 2021

On Mon, 2021-01-11 at 08:25 +0100, Ulrich Windl wrote:
> > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 08.01.2021 um
> > > > 17:38 in
> 
> Nachricht
> <662b69bff331fae41771cf8833e819c2d5b18044.camel at redhat.com>:
> > On Fri, 2021‑01‑08 at 11:46 +0100, Ulrich Windl wrote:
> > > Hi!
> > > 
> > > Trying to reproduce a problem that had occurred in the past after
> > > a
> > > "crm resource refresh" ("reprobe"), I noticed something on the
> > > DC  that looks odd to me:
> > > 
> > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: Forcing
> > > the
> > > status of all resources to be redetected
> > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  warning:
> > > new_event_notification (4478‑26817‑13): Broken pipe (32)
> > 
> > As an aside, the "Broken pipe" means the client disconnected before
> > getting all results back from the controller. It's not really a
> > problem. There has been some discussion about changing "Broken
> > pipe" to
> > something like "Other side disconnected".
> 
> Ken,
> 
> thanks once again for explaining. I'm still confused...See below.
> 
> > 
> > > ### We had that before, already...
> > > 
> > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: State
> > > transition S_IDLE ‑> S_POLICY_ENGINE
> > > Jan 08 11:13:21 h16 pacemaker‑schedulerd[4477]:  notice: Watchdog
> > > will be used via SBD if fencing is required and stonith‑watchdog‑
> > > timeout is nonzero
> > > Jan 08 11:13:21 h16 pacemaker‑schedulerd[4477]:  notice:  *
> > > Start      prm_stonith_sbd                      (             h16
> > > )
> > > Jan 08 11:13:21 h16 pacemaker‑schedulerd[4477]:  notice:  *
> > > Start      prm_DLM:0                            (             h18
> > > )
> > > Jan 08 11:13:21 h16 pacemaker‑schedulerd[4477]:  notice:  *
> > > Start      prm_DLM:1                            (             h19
> > > )
> > > Jan 08 11:13:21 h16 pacemaker‑schedulerd[4477]:  notice:  *
> > > Start      prm_DLM:2                            (             h16
> > > )
> > > ...
> > > 
> > > ## So basically an announcemt to START everything that's running
> > > (everything is running); shouldn't that be "monitoring" (probe)
> > > instead?
> > 
> > Pacemaker schedules all actions that could be needed to bring the
> > cluster to the desired state (per the configuration). However later
> > actions depend on earlier actions getting certain results, and
> > everything will be recalculated if they don't.
> > 
> > For clean‑ups, Pacemaker schedules probes, and assumes they will
> > all
> > return "not running", so it schedules starts to occur after them.
> > The
> > logs above are indicating that.
> 
> I thought a refresh / re-probe is there to compare the current state
> via
> probes to the last stored starte, and THEN make any corrections
> needed.
> Assuming nothing is running is as wrong as assuming everything is
> running
> IMHO.

Nope, a clean-up or refresh erases all history for a resource or
resources, which is what indicates to the scheduler that a new probe is
needed.

> > 
> > However:
> > 
> > > 
> > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: Initiating
> > > monitor operation prm_stonith_sbd_monitor_0 on h19
> > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: Initiating
> > > monitor operation prm_stonith_sbd_monitor_0 on h18
> > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: Initiating
> > > monitor operation prm_stonith_sbd_monitor_0 locally on h16
> > > ...
> > > ### So _probes_ are started,
> > > 
> > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: Transition
> > > 139
> > > aborted by operation prm_testVG_testLV_activate_monitor_0
> > > 'modify' on
> > > h16: Event failed
> > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: Transition
> > > 139
> > > action 7 (prm_testVG_testLV_activate_monitor_0 on h16): expected
> > > 'not
> > > running' but got 'ok'
> > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: Transition
> > > 139
> > > action 19 (prm_testVG_testLV_activate_monitor_0 on h18): expected
> > > 'not running' but got 'ok'
> > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: Transition
> > > 139
> > > action 31 (prm_testVG_testLV_activate_monitor_0 on h19): expected
> > > 'not running' but got 'ok'
> > > ...
> > > ### That's odd, because the clone WAS running on each node.
> > > (Similar 
> > > results were reported for other clones)
> > 
> > The probes don't return "not running", they return "ok" since the
> > resources are actually running. That prevents the starts from
> > actually
> > happening, and everything is recalculated with the new probe
> > results.
> 
> With the comment made earlier, the differences observed (and thus the
> messages
> being logged) would be much smaller if only real differenced had to
> be
> considered.
> 
> > 
> > Pacemaker doesn't expect the probes to return "ok" because the
> > clean‑up 
> > cleared all information that would lead it to do so. In other
> > words,
> > Pacemaker doesn't remember any state from before the clean‑up.
> > That's
> 
> I did not do a cleanup; I did  re-probe.

The only difference is that clean-up will erase the history of a
resource only if there's a failure, while reprobe will erase history
regardless.

> Maybe the implementation of re-probe is just that of "cleanup
> everything", but
> I'd consider that being quite wrong (while easy to implement,
> probably).
> If I "cleanup" the resource state is lost or unknown nad has to be
> probed to
> get the state. But when I reprobe, I have an assumed state and a real
> state. A
> probe is needed to complkare thise for differences. Only if there ARE
> differences, actions should be scheduled.

The problem is that the current state is not a probe result alone -- it
is the cumulative effect of all operations that have run on the
resource.

For example, if a resource history is just one probe with a result of
"not running", then the resource state is inactive. But if the resource
history is a probe with a result of "not running" plus a successful
start, then the resource state is active.

Thus, simply re-probing wouldn't be sufficient -- we'd have to reprobe
*and* erase all previous history at the same time as recording the new
probe result. Which would be possible but isn't how it's done now,
instead we erase the history first and let the probe be scheduled the
same as if the node had just started.

To summarize, we could get the result you have in mind by (1) allowing
resource history entries to take a new "obsolete" attribute, (2) having
cleanup/refresh mark history as obsolete instead of removing it, (3)
having the scheduler schedule a new probe when a resource has only
obsolete history (with an expected result equivalent to the resource
state calculated from the entire obsolete history); and (4) erasing the
obsolete history when the new probe result is recorded (to keep the CIB
from growing indefinitely). (Plus some timing issues to consider.)

> I've see that re-probe caused problems in the past, now I'm beginning
> to
> understand why.
> 
> > why it needs the probes, to find out what the current state
> > actually
> > is.
> > 
> > To change it to be more "human", we'd have to change clean‑up to
> > mark
> > existing state as obsolete rather than remove it entirely. Then
> > Pacemaker could see that new probes are needed, but use the last
> > known
> > result as the expected result. That could save some recalculation
> > and
> > make the logs easier to follow, but it would complicate the
> > resource
> > history handling and wouldn't change the end result.
> 
> The end result maybe not, but I'm not convinced about needless
> operations.
> 
> > 
> > > Jan 08 11:13:43 h16 pacemaker‑controld[4478]:  notice: Transition
> > > 140
> > > (Complete=34, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> > > Source=/var/lib/pacemaker/pengine/pe‑input‑79.bz2): Complete
> > > Jan 08 11:13:43 h16 pacemaker‑controld[4478]:  notice: State
> > > transition S_TRANSITION_ENGINE ‑> S_IDLE
> > > ### So in the end nothing was actually started, but those
> > > messages
> > > are quite confusing.
> 
> I think pacemaker is "too noisy" log-wise : Locating problems is very
> much
> like seeking a needle in a haystack thus.
> 
> ...
> Regards,
> Ulrich
-- 
Ken Gaillot <kgaillot at redhat.com>