[ClusterLabs] Antw: Antw: Re: Q: "crmd[7281]: warning: new_event_notification (7281-97955-15): Broken pipe (32)" as response to resource cleanup

Tue Aug 13 05:06:25 EDT 2019

Hi,

an update:
After setting a failure-timeout for the resource that stale monitor failure
was removed automatically at next cluster recheck (it seems).
Still I wonder why a resource cleanup didn't do that (bug?).

Regards,
Ulrich

>>> "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de> schrieb am 13.08.2019
um
10:07 in Nachricht <5D526FB0020000A100032D82 at gwsmtp.uni-regensburg.de>:
>>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 13.08.2019 um 01:03 in
> Nachricht
> <ec355e98371d22d6c83e4927f65fdfa711266c21.camel at redhat.com>:
>> On Mon, 2019‑08‑12 at 17:46 +0200, Ulrich Windl wrote:
>>> Hi!
>>> 
>>> I just noticed that a "crm resource cleanup <rsc>" caused some
>>> unexpected behavior and the syslog message:
>>> crmd[7281]:  warning: new_event_notification (7281‑97955‑15): Broken
>>> pipe (32)
>>> 
>>> It's SLES14 SP4 last updated Sept. 2018 (up since then, pacemaker‑
>>> 1.1.19+20180928.0d2680780‑1.8.x86_64).
>>> 
>>> The cleanup was due to a failed monitor. As an unexpected consequence
>>> of this cleanup, CRM seemed to restart the complete resource (and
>>> dependencies), even though it was running.
>> 
>> I assume the monitor failure was old, and recovery had already
>> completed? If not, recovery might have been initiated before the clean‑
>> up was recorded.
>> 
>>> I noticed that a manual "crm_resource ‑C ‑r <rsc> ‑N <node>" command
>>> has the same effect (multiple resources are "Cleaned up", resources
>>> are restarted seemingly before the "probe" is done.).
>> 
>> Can you verify whether the probes were done? The DC should log a
>> message when each <rsc>_monitor_0 result comes in.
> 
> So here's a rough sketch of events:
> 17:10:23 crmd[7281]:   notice: State transition S_IDLE -> S_POLICY_ENGINE
> ...no probes yet...
> 17:10:24 pengine[7280]:  warning: Processing failed monitor of 
> prm_nfs_server
> on rksaph11: not running
> ...lots of starts/restarts...
> 17:10:24 pengine[7280]:   notice:  * Restart    prm_nfs_server  
> ...
> 17:10:24 crmd[7281]:   notice: Processing graph 6628
> (ref=pe_calc-dc-1565622624-7313) derived from
> /var/lib/pacemaker/pengine/pe-input-1810.bz2
> ...monitors are being called...
> 17:10:24 crmd[7281]:   notice: Result of probe operation for prm_nfs_vg on
> h11: 0 (ok)
> ...the above was the first probe result...
> 17:10:24 crmd[7281]:  warning: Action 33 (prm_nfs_vg_monitor_0) on h11 
> failed
> (target: 7 vs. rc: 0): Error
> ...not surprising to me: The resource was running; I don't know why the
> cluster want to start it...
> 17:10:24 crmd[7281]:   notice: Transition 6629 (Complete=9, Pending=0,
> Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-1811.bz2): Complete
> 17:10:24 crmd[7281]:   notice: State transition S_TRANSITION_ENGINE ->
S_IDLE
> 
> The really bad thing after this is that the "cleaned up" resource still has

> a
> failed status (dated in the past (last-rc-change='Mon Aug 12 04:52:23 
> 2019')),
> even though "running".
> 
> I tend to believe that the cluster is in a bad state, or the software has a
> problem cleaning the status of the monitor.
> 
> The CIB status for the resource looks like this:
>         <lrm_resource id="prm_nfs_server" type="nfsserver" class="ocf"
> provider="heartbeat">
>           <lrm_rsc_op id="prm_nfs_server_last_0"
> operation_key="prm_nfs_server_start_0" operation="start"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.14"
> transition-key="67:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> transition-magic="0:0;67:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> exit-reason="" on_node="h11" call-id="799" rc-code="0" op-status="0"
> interval="0" last-run="1565582351" last-rc-change="1565582351" 
> exec-time="708"
> queue-time="0" op-digest="73311a0ef4ba8e9f1f97e05e989f6348"/>
>           <lrm_rsc_op id="prm_nfs_server_monitor_60000"
> operation_key="prm_nfs_server_monitor_60000" operation="monitor"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.14"
> transition-key="68:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> transition-magic="0:0;68:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> exit-reason="" on_node="h11" call-id="800" rc-code="0" op-status="0"
> interval="60000" last-rc-change="1565582351" exec-time="499" queue-time="0"
> op-digest="9d8aa17b2a741c8328d7896459733e56"/>
>           <lrm_rsc_op id="prm_nfs_server_last_failure_0"
> operation_key="prm_nfs_server_monitor_60000" operation="monitor"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.14"
> transition-key="4:6568:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> transition-magic="0:7;4:6568:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> exit-reason="" on_node="h11" call-id="738" rc-code="7" op-status="0"
> interval="60000" last-rc-change="1565578343" exec-time="0" queue-time="0"
> op-digest="9d8aa17b2a741c8328d7896459733e56"/>
>         </lrm_resource>
> 
> 
> Regards,
> Ulrich
> 
>> 
>>> Actually the manual says when cleaning up a single primitive, the
>>> whole group is cleaned up, unless using ‑‑force. Well ,I don't like
>>> this default, as I expect any status change from probe would
>>> propagate to the group anyway...
>> 
>> In 1.1, clean‑up always wipes the history of the affected resources,
>> regardless of whether the history is for success or failure. That means
>> all the cleaned resources will be reprobed. In 2.0, clean‑up by default
>> wipes the history only if there's a failed action (‑‑refresh/‑R is
>> required to get the 1.1 behavior). That lessens the impact of the
>> "default to whole group" behavior.
>> 
>> I think the original idea was that a group indicates that the resources
>> are closely related, so changing the status of one member might affect
>> what status the others report.
>> 
>>> Regards,
>>> Ulrich
>> ‑‑ 
>> Ken Gaillot <kgaillot at redhat.com>
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/