[ClusterLabs] Antw: Re: Antw: [EXT] Re: A bug? (SLES15 SP2 with "crm resource refresh")

Ken Gaillot kgaillot at redhat.com
Mon Jan 11 10:45:36 EST 2021


On Mon, 2021-01-11 at 16:31 +0100, Ulrich Windl wrote:
> > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 11.01.2021 um
> > > > 15:46 in
> 
> Nachricht
> <3df79a20eb4440357759cca4fe5b0e0729e47085.camel at redhat.com>:
> > On Mon, 2021-01-11 at 08:25 +0100, Ulrich Windl wrote:
> > > > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 08.01.2021 um
> > > > > > 17:38 in
> > > 
> > > Nachricht
> > > <662b69bff331fae41771cf8833e819c2d5b18044.camel at redhat.com>:
> > > > On Fri, 2021‑01‑08 at 11:46 +0100, Ulrich Windl wrote:
> > > > > Hi!
> > > > > 
> > > > > Trying to reproduce a problem that had occurred in the past
> > > > > after
> > > > > a
> > > > > "crm resource refresh" ("reprobe"), I noticed something on
> > > > > the
> > > > > DC  that looks odd to me:
> > > > > 
> > > > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice:
> > > > > Forcing
> > > > > the
> > > > > status of all resources to be redetected
> > > > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  warning:
> > > > > new_event_notification (4478‑26817‑13): Broken pipe (32)
> > > > 
> > > > As an aside, the "Broken pipe" means the client disconnected
> > > > before
> > > > getting all results back from the controller. It's not really a
> > > > problem. There has been some discussion about changing "Broken
> > > > pipe" to
> > > > something like "Other side disconnected".
> > > 
> > > Ken,
> > > 
> > > thanks once again for explaining. I'm still confused...See below.
> > > 
> > > > 
> > > > > ### We had that before, already...
> > > > > 
> > > > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: State
> > > > > transition S_IDLE ‑> S_POLICY_ENGINE
> > > > > Jan 08 11:13:21 h16 pacemaker‑schedulerd[4477]:  notice:
> > > > > Watchdog
> > > > > will be used via SBD if fencing is required and
> > > > > stonith‑watchdog‑
> > > > > timeout is nonzero
> > > > > Jan 08 11:13:21 h16 pacemaker‑schedulerd[4477]:  notice:  *
> > > > > Start      prm_stonith_sbd                      (            
> > > > >  h16
> > > > > )
> > > > > Jan 08 11:13:21 h16 pacemaker‑schedulerd[4477]:  notice:  *
> > > > > Start      prm_DLM:0                            (            
> > > > >  h18
> > > > > )
> > > > > Jan 08 11:13:21 h16 pacemaker‑schedulerd[4477]:  notice:  *
> > > > > Start      prm_DLM:1                            (            
> > > > >  h19
> > > > > )
> > > > > Jan 08 11:13:21 h16 pacemaker‑schedulerd[4477]:  notice:  *
> > > > > Start      prm_DLM:2                            (            
> > > > >  h16
> > > > > )
> > > > > ...
> > > > > 
> > > > > ## So basically an announcemt to START everything that's
> > > > > running
> > > > > (everything is running); shouldn't that be "monitoring"
> > > > > (probe)
> > > > > instead?
> > > > 
> > > > Pacemaker schedules all actions that could be needed to bring
> > > > the
> > > > cluster to the desired state (per the configuration). However
> > > > later
> > > > actions depend on earlier actions getting certain results, and
> > > > everything will be recalculated if they don't.
> > > > 
> > > > For clean‑ups, Pacemaker schedules probes, and assumes they
> > > > will
> > > > all
> > > > return "not running", so it schedules starts to occur after
> > > > them.
> > > > The
> > > > logs above are indicating that.
> > > 
> > > I thought a refresh / re-probe is there to compare the current
> > > state
> > > via
> > > probes to the last stored starte, and THEN make any corrections
> > > needed.
> > > Assuming nothing is running is as wrong as assuming everything is
> > > running
> > > IMHO.
> > 
> > Nope, a clean-up or refresh erases all history for a resource or
> > resources, which is what indicates to the scheduler that a new
> > probe is
> > needed.
> > 
> > > > 
> > > > However:
> > > > 
> > > > > 
> > > > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice:
> > > > > Initiating
> > > > > monitor operation prm_stonith_sbd_monitor_0 on h19
> > > > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice:
> > > > > Initiating
> > > > > monitor operation prm_stonith_sbd_monitor_0 on h18
> > > > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice:
> > > > > Initiating
> > > > > monitor operation prm_stonith_sbd_monitor_0 locally on h16
> > > > > ...
> > > > > ### So _probes_ are started,
> > > > > 
> > > > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice:
> > > > > Transition
> > > > > 139
> > > > > aborted by operation prm_testVG_testLV_activate_monitor_0
> > > > > 'modify' on
> > > > > h16: Event failed
> > > > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice:
> > > > > Transition
> > > > > 139
> > > > > action 7 (prm_testVG_testLV_activate_monitor_0 on h16):
> > > > > expected
> > > > > 'not
> > > > > running' but got 'ok'
> > > > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice:
> > > > > Transition
> > > > > 139
> > > > > action 19 (prm_testVG_testLV_activate_monitor_0 on h18):
> > > > > expected
> > > > > 'not running' but got 'ok'
> > > > > Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice:
> > > > > Transition
> > > > > 139
> > > > > action 31 (prm_testVG_testLV_activate_monitor_0 on h19):
> > > > > expected
> > > > > 'not running' but got 'ok'
> > > > > ...
> > > > > ### That's odd, because the clone WAS running on each node.
> > > > > (Similar 
> > > > > results were reported for other clones)
> > > > 
> > > > The probes don't return "not running", they return "ok" since
> > > > the
> > > > resources are actually running. That prevents the starts from
> > > > actually
> > > > happening, and everything is recalculated with the new probe
> > > > results.
> > > 
> > > With the comment made earlier, the differences observed (and thus
> > > the
> > > messages
> > > being logged) would be much smaller if only real differenced had
> > > to
> > > be
> > > considered.
> > > 
> > > > 
> > > > Pacemaker doesn't expect the probes to return "ok" because the
> > > > clean‑up 
> > > > cleared all information that would lead it to do so. In other
> > > > words,
> > > > Pacemaker doesn't remember any state from before the clean‑up.
> > > > That's
> > > 
> > > I did not do a cleanup; I did  re-probe.
> > 
> > The only difference is that clean-up will erase the history of a
> > resource only if there's a failure, while reprobe will erase
> > history
> > regardless.
> > 
> > > Maybe the implementation of re-probe is just that of "cleanup
> > > everything", but
> > > I'd consider that being quite wrong (while easy to implement,
> > > probably).
> > > If I "cleanup" the resource state is lost or unknown nad has to
> > > be
> > > probed to
> > > get the state. But when I reprobe, I have an assumed state and a
> > > real
> > > state. A
> > > probe is needed to complkare thise for differences. Only if there
> > > ARE
> > > differences, actions should be scheduled.
> > 
> > The problem is that the current state is not a probe result alone
> > -- it
> > is the cumulative effect of all operations that have run on the
> > resource.
> > 
> > For example, if a resource history is just one probe with a result
> > of
> > "not running", then the resource state is inactive. But if the
> > resource
> > history is a probe with a result of "not running" plus a successful
> > start, then the resource state is active.
> 
> Yes, but a "start" would only be necessary if the resource should be
> running,
> but isn't. For just refreshing the "as is" status it's not necessary.

That's a different aspect -- I mean that if the *previous* history is a
probe plus a start, then the resource's calculated state is active. My
point is that the entire previous history, not just the previous probe
result, is needed to calculate the expected state (in the sense you're
using it) of the next probe.

I believe you're referring to the scheduler scheduling both a probe and
a start when there is no history. That's intentional -- a transition is
*all* steps needed to bring the cluster from the current state to the
desired (i.e. configured) state. Whether each step is *actually* done
depends on whether the earlier steps proceeded successfully.

Basically we're avoiding the need to calculate a new transition after
every action result. Each transition is self-complete as long as each
action within it has the assumed result. If it doesn't, no big deal, we
calculate a new one in that case.

> > Thus, simply re-probing wouldn't be sufficient -- we'd have to
> > reprobe
> > *and* erase all previous history at the same time as recording the
> > new
> > probe result. Which would be possible but isn't how it's done now,
> > instead we erase the history first and let the probe be scheduled
> > the
> > same as if the node had just started.
> 
> I still feel a reprobe should compare the current status with the
> expected
> status, and only if it's different update the state in CIB.

Right, but currently the scheduler can only schedule a probe if there
is no history, and it can't know the expected state (in the sense
you're using it) without the history. To do what you're suggesting,
we'd need to reimplement it along the lines below, to both keep the
history and schedule a probe.

> And at the end if there was any status change, call the CRM to
> perform any
> actions derived from the status update.
> > 
> > To summarize, we could get the result you have in mind by (1)
> > allowing
> > resource history entries to take a new "obsolete" attribute, (2)
> > having
> > cleanup/refresh mark history as obsolete instead of removing it,
> > (3)
> > having the scheduler schedule a new probe when a resource has only
> > obsolete history (with an expected result equivalent to the
> > resource
> > state calculated from the entire obsolete history); and (4) erasing
> > the
> > obsolete history when the new probe result is recorded (to keep the
> > CIB
> > from growing indefinitely). (Plus some timing issues to consider.)
> 
> Wouldn't a temporary local status variable do also?

No, the scheduler is stateless. All information that the scheduler
needs must be contained within the CIB.

The main advantages of that approach are (1) the scheduler can crash
and respawn without causing any problems; (2) the DC can be changed to
another node at any time without causing any problems; and (3) saved
CIBs can be replayed for debugging and testing purposes with the
identical result as a live cluster.

> Regards,
> Ulrich
> 
> > 
> > > I've see that re-probe caused problems in the past, now I'm
> > > beginning
> > > to
> > > understand why.
> > > 
> > > > why it needs the probes, to find out what the current state
> > > > actually
> > > > is.
> > > > 
> > > > To change it to be more "human", we'd have to change clean‑up
> > > > to
> > > > mark
> > > > existing state as obsolete rather than remove it entirely. Then
> > > > Pacemaker could see that new probes are needed, but use the
> > > > last
> > > > known
> > > > result as the expected result. That could save some
> > > > recalculation
> > > > and
> > > > make the logs easier to follow, but it would complicate the
> > > > resource
> > > > history handling and wouldn't change the end result.
> > > 
> > > The end result maybe not, but I'm not convinced about needless
> > > operations.
> > > 
> > > > 
> > > > > Jan 08 11:13:43 h16 pacemaker‑controld[4478]:  notice:
> > > > > Transition
> > > > > 140
> > > > > (Complete=34, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> > > > > Source=/var/lib/pacemaker/pengine/pe‑input‑79.bz2): Complete
> > > > > Jan 08 11:13:43 h16 pacemaker‑controld[4478]:  notice: State
> > > > > transition S_TRANSITION_ENGINE ‑> S_IDLE
> > > > > ### So in the end nothing was actually started, but those
> > > > > messages
> > > > > are quite confusing.
> > > 
> > > I think pacemaker is "too noisy" log-wise : Locating problems is
> > > very
> > > much
> > > like seeking a needle in a haystack thus.
> > > 
> > > ...
> > > Regards,
> > > Ulrich

-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list