[ClusterLabs] Antw: [EXT] Re: A bug? (SLES15 SP2 with "crm resource refresh")

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Mon Jan 11 02:25:54 EST 2021


>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 08.01.2021 um 17:38 in
Nachricht
<662b69bff331fae41771cf8833e819c2d5b18044.camel at redhat.com>:
> On Fri, 2021‑01‑08 at 11:46 +0100, Ulrich Windl wrote:
>> Hi!
>> 
>> Trying to reproduce a problem that had occurred in the past after a
>> "crm resource refresh" ("reprobe"), I noticed something on the
>> DC  that looks odd to me:
>> 
>> Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: Forcing the
>> status of all resources to be redetected
>> Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  warning:
>> new_event_notification (4478‑26817‑13): Broken pipe (32)
> 
> As an aside, the "Broken pipe" means the client disconnected before
> getting all results back from the controller. It's not really a
> problem. There has been some discussion about changing "Broken pipe" to
> something like "Other side disconnected".

Ken,

thanks once again for explaining. I'm still confused...See below.

> 
>> ### We had that before, already...
>> 
>> Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: State
>> transition S_IDLE ‑> S_POLICY_ENGINE
>> Jan 08 11:13:21 h16 pacemaker‑schedulerd[4477]:  notice: Watchdog
>> will be used via SBD if fencing is required and stonith‑watchdog‑
>> timeout is nonzero
>> Jan 08 11:13:21 h16 pacemaker‑schedulerd[4477]:  notice:  *
>> Start      prm_stonith_sbd                      (             h16 )
>> Jan 08 11:13:21 h16 pacemaker‑schedulerd[4477]:  notice:  *
>> Start      prm_DLM:0                            (             h18 )
>> Jan 08 11:13:21 h16 pacemaker‑schedulerd[4477]:  notice:  *
>> Start      prm_DLM:1                            (             h19 )
>> Jan 08 11:13:21 h16 pacemaker‑schedulerd[4477]:  notice:  *
>> Start      prm_DLM:2                            (             h16 )
>> ...
>> 
>> ## So basically an announcemt to START everything that's running
>> (everything is running); shouldn't that be "monitoring" (probe)
>> instead?
> 
> Pacemaker schedules all actions that could be needed to bring the
> cluster to the desired state (per the configuration). However later
> actions depend on earlier actions getting certain results, and
> everything will be recalculated if they don't.
> 
> For clean‑ups, Pacemaker schedules probes, and assumes they will all
> return "not running", so it schedules starts to occur after them. The
> logs above are indicating that.

I thought a refresh / re-probe is there to compare the current state via
probes to the last stored starte, and THEN make any corrections needed.
Assuming nothing is running is as wrong as assuming everything is running
IMHO.

> 
> However:
> 
>> 
>> Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: Initiating
>> monitor operation prm_stonith_sbd_monitor_0 on h19
>> Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: Initiating
>> monitor operation prm_stonith_sbd_monitor_0 on h18
>> Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: Initiating
>> monitor operation prm_stonith_sbd_monitor_0 locally on h16
>> ...
>> ### So _probes_ are started,
>> 
>> Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: Transition 139
>> aborted by operation prm_testVG_testLV_activate_monitor_0 'modify' on
>> h16: Event failed
>> Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: Transition 139
>> action 7 (prm_testVG_testLV_activate_monitor_0 on h16): expected 'not
>> running' but got 'ok'
>> Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: Transition 139
>> action 19 (prm_testVG_testLV_activate_monitor_0 on h18): expected
>> 'not running' but got 'ok'
>> Jan 08 11:13:21 h16 pacemaker‑controld[4478]:  notice: Transition 139
>> action 31 (prm_testVG_testLV_activate_monitor_0 on h19): expected
>> 'not running' but got 'ok'
>> ...
>> ### That's odd, because the clone WAS running on each node. (Similar 
>> results were reported for other clones)
> 
> The probes don't return "not running", they return "ok" since the
> resources are actually running. That prevents the starts from actually
> happening, and everything is recalculated with the new probe results.

With the comment made earlier, the differences observed (and thus the messages
being logged) would be much smaller if only real differenced had to be
considered.

> 
> Pacemaker doesn't expect the probes to return "ok" because the clean‑up 
> cleared all information that would lead it to do so. In other words,
> Pacemaker doesn't remember any state from before the clean‑up. That's

I did not do a cleanup; I did  re-probe.
Maybe the implementation of re-probe is just that of "cleanup everything", but
I'd consider that being quite wrong (while easy to implement, probably).
If I "cleanup" the resource state is lost or unknown nad has to be probed to
get the state. But when I reprobe, I have an assumed state and a real state. A
probe is needed to complkare thise for differences. Only if there ARE
differences, actions should be scheduled.

I've see that re-probe caused problems in the past, now I'm beginning to
understand why.

> why it needs the probes, to find out what the current state actually
> is.
> 
> To change it to be more "human", we'd have to change clean‑up to mark
> existing state as obsolete rather than remove it entirely. Then
> Pacemaker could see that new probes are needed, but use the last known
> result as the expected result. That could save some recalculation and
> make the logs easier to follow, but it would complicate the resource
> history handling and wouldn't change the end result.

The end result maybe not, but I'm not convinced about needless operations.

> 
>> Jan 08 11:13:43 h16 pacemaker‑controld[4478]:  notice: Transition 140
>> (Complete=34, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe‑input‑79.bz2): Complete
>> Jan 08 11:13:43 h16 pacemaker‑controld[4478]:  notice: State
>> transition S_TRANSITION_ENGINE ‑> S_IDLE
>> ### So in the end nothing was actually started, but those messages
>> are quite confusing.

I think pacemaker is "too noisy" log-wise : Locating problems is very much
like seeking a needle in a haystack thus.

...
Regards,
Ulrich



More information about the Users mailing list