[ClusterLabs] Antw: Antw: Re: Q: "crmd[7281]: warning: new_event_notification (7281-97955-15): Broken pipe (32)" as response to resource cleanup

Ken Gaillot kgaillot at redhat.com
Tue Aug 13 11:48:26 EDT 2019


On Tue, 2019-08-13 at 11:06 +0200, Ulrich Windl wrote:
> Hi,
> 
> an update:
> After setting a failure-timeout for the resource that stale monitor
> failure
> was removed automatically at next cluster recheck (it seems).
> Still I wonder why a resource cleanup didn't do that (bug?).

Possibly ... also possibly fixed already, as there have been a few
clean-up related fixes in the past few versions. I'm not sure what's
been backported in the build you have.

> 
> Regards,
> Ulrich
> 
> 
> > > > "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de> schrieb am
> > > > 13.08.2019
> 
> um
> 10:07 in Nachricht <5D526FB0020000A100032D82 at gwsmtp.uni-regensburg.de
> >:
> > > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 13.08.2019 um
> > > > > 01:03 in
> > 
> > Nachricht
> > <ec355e98371d22d6c83e4927f65fdfa711266c21.camel at redhat.com>:
> > > On Mon, 2019‑08‑12 at 17:46 +0200, Ulrich Windl wrote:
> > > > Hi!
> > > > 
> > > > I just noticed that a "crm resource cleanup <rsc>" caused some
> > > > unexpected behavior and the syslog message:
> > > > crmd[7281]:  warning: new_event_notification (7281‑97955‑15):
> > > > Broken
> > > > pipe (32)
> > > > 
> > > > It's SLES14 SP4 last updated Sept. 2018 (up since then,
> > > > pacemaker‑
> > > > 1.1.19+20180928.0d2680780‑1.8.x86_64).
> > > > 
> > > > The cleanup was due to a failed monitor. As an unexpected
> > > > consequence
> > > > of this cleanup, CRM seemed to restart the complete resource
> > > > (and
> > > > dependencies), even though it was running.
> > > 
> > > I assume the monitor failure was old, and recovery had already
> > > completed? If not, recovery might have been initiated before the
> > > clean‑
> > > up was recorded.
> > > 
> > > > I noticed that a manual "crm_resource ‑C ‑r <rsc> ‑N <node>"
> > > > command
> > > > has the same effect (multiple resources are "Cleaned up",
> > > > resources
> > > > are restarted seemingly before the "probe" is done.).
> > > 
> > > Can you verify whether the probes were done? The DC should log a
> > > message when each <rsc>_monitor_0 result comes in.
> > 
> > So here's a rough sketch of events:
> > 17:10:23 crmd[7281]:   notice: State transition S_IDLE ->
> > S_POLICY_ENGINE
> > ...no probes yet...
> > 17:10:24 pengine[7280]:  warning: Processing failed monitor of 
> > prm_nfs_server
> > on rksaph11: not running
> > ...lots of starts/restarts...
> > 17:10:24 pengine[7280]:   notice:  * Restart    prm_nfs_server  
> > ...
> > 17:10:24 crmd[7281]:   notice: Processing graph 6628
> > (ref=pe_calc-dc-1565622624-7313) derived from
> > /var/lib/pacemaker/pengine/pe-input-1810.bz2
> > ...monitors are being called...
> > 17:10:24 crmd[7281]:   notice: Result of probe operation for
> > prm_nfs_vg on
> > h11: 0 (ok)
> > ...the above was the first probe result...
> > 17:10:24 crmd[7281]:  warning: Action 33 (prm_nfs_vg_monitor_0) on
> > h11 
> > failed
> > (target: 7 vs. rc: 0): Error
> > ...not surprising to me: The resource was running; I don't know why
> > the
> > cluster want to start it...

That's normal, that's how pacemaker detects active resources after
clean-up. It schedules a probe and start in the assumption that the
probe will find the resource not running; if it is running, the probe
result will cause a new transition where the start isn't needed.

That message will be improved in the next version (already in master
branch), like:

notice: Transition 10 action 33 (prm_nfs_vg_monitor_0 on h11): expected
'not running' but got 'ok'

> > 17:10:24 crmd[7281]:   notice: Transition 6629 (Complete=9,
> > Pending=0,
> > Fired=0, Skipped=0, Incomplete=0,
> > Source=/var/lib/pacemaker/pengine/pe-input-1811.bz2): Complete
> > 17:10:24 crmd[7281]:   notice: State transition S_TRANSITION_ENGINE
> > ->
> 
> S_IDLE
> > 
> > The really bad thing after this is that the "cleaned up" resource
> > still has
> > a
> > failed status (dated in the past (last-rc-change='Mon Aug 12
> > 04:52:23 
> > 2019')),
> > even though "running".
> > 
> > I tend to believe that the cluster is in a bad state, or the
> > software has a
> > problem cleaning the status of the monitor.

It does sound like a clean-up bug, but I'm not aware of any current
issues. I suspect it's already fixed.

> > The CIB status for the resource looks like this:
> >         <lrm_resource id="prm_nfs_server" type="nfsserver"
> > class="ocf"
> > provider="heartbeat">
> >           <lrm_rsc_op id="prm_nfs_server_last_0"
> > operation_key="prm_nfs_server_start_0" operation="start"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14"
> > transition-key="67:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> > transition-magic="0:0;67:6583:0:d941efc1-de73-4ee4-b593-
> > f65be9e90726"
> > exit-reason="" on_node="h11" call-id="799" rc-code="0" op-
> > status="0"
> > interval="0" last-run="1565582351" last-rc-change="1565582351" 
> > exec-time="708"
> > queue-time="0" op-digest="73311a0ef4ba8e9f1f97e05e989f6348"/>
> >           <lrm_rsc_op id="prm_nfs_server_monitor_60000"
> > operation_key="prm_nfs_server_monitor_60000" operation="monitor"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14"
> > transition-key="68:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> > transition-magic="0:0;68:6583:0:d941efc1-de73-4ee4-b593-
> > f65be9e90726"
> > exit-reason="" on_node="h11" call-id="800" rc-code="0" op-
> > status="0"
> > interval="60000" last-rc-change="1565582351" exec-time="499" queue-
> > time="0"
> > op-digest="9d8aa17b2a741c8328d7896459733e56"/>
> >           <lrm_rsc_op id="prm_nfs_server_last_failure_0"
> > operation_key="prm_nfs_server_monitor_60000" operation="monitor"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14"
> > transition-key="4:6568:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> > transition-magic="0:7;4:6568:0:d941efc1-de73-4ee4-b593-
> > f65be9e90726"
> > exit-reason="" on_node="h11" call-id="738" rc-code="7" op-
> > status="0"
> > interval="60000" last-rc-change="1565578343" exec-time="0" queue-
> > time="0"
> > op-digest="9d8aa17b2a741c8328d7896459733e56"/>
> >         </lrm_resource>
> > 
> > 
> > Regards,
> > Ulrich
> > 
> > > 
> > > > Actually the manual says when cleaning up a single primitive,
> > > > the
> > > > whole group is cleaned up, unless using ‑‑force. Well ,I don't
> > > > like
> > > > this default, as I expect any status change from probe would
> > > > propagate to the group anyway...
> > > 
> > > In 1.1, clean‑up always wipes the history of the affected
> > > resources,
> > > regardless of whether the history is for success or failure. That
> > > means
> > > all the cleaned resources will be reprobed. In 2.0, clean‑up by
> > > default
> > > wipes the history only if there's a failed action (‑‑refresh/‑R
> > > is
> > > required to get the 1.1 behavior). That lessens the impact of the
> > > "default to whole group" behavior.
> > > 
> > > I think the original idea was that a group indicates that the
> > > resources
> > > are closely related, so changing the status of one member might
> > > affect
> > > what status the others report.
> > > 
> > > > Regards,
> > > > Ulrich
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list