[ClusterLabs] Antw: Re: Q: "crmd[7281]: warning: new_event_notification (7281-97955-15): Broken pipe (32)" as response to resource cleanup

Tue Aug 13 04:07:12 EDT 2019

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 13.08.2019 um 01:03 in
Nachricht
<ec355e98371d22d6c83e4927f65fdfa711266c21.camel at redhat.com>:
> On Mon, 2019‑08‑12 at 17:46 +0200, Ulrich Windl wrote:
>> Hi!
>> 
>> I just noticed that a "crm resource cleanup <rsc>" caused some
>> unexpected behavior and the syslog message:
>> crmd[7281]:  warning: new_event_notification (7281‑97955‑15): Broken
>> pipe (32)
>> 
>> It's SLES14 SP4 last updated Sept. 2018 (up since then, pacemaker‑
>> 1.1.19+20180928.0d2680780‑1.8.x86_64).
>> 
>> The cleanup was due to a failed monitor. As an unexpected consequence
>> of this cleanup, CRM seemed to restart the complete resource (and
>> dependencies), even though it was running.
> 
> I assume the monitor failure was old, and recovery had already
> completed? If not, recovery might have been initiated before the clean‑
> up was recorded.
> 
>> I noticed that a manual "crm_resource ‑C ‑r <rsc> ‑N <node>" command
>> has the same effect (multiple resources are "Cleaned up", resources
>> are restarted seemingly before the "probe" is done.).
> 
> Can you verify whether the probes were done? The DC should log a
> message when each <rsc>_monitor_0 result comes in.

So here's a rough sketch of events:
17:10:23 crmd[7281]:   notice: State transition S_IDLE -> S_POLICY_ENGINE
...no probes yet...
17:10:24 pengine[7280]:  warning: Processing failed monitor of prm_nfs_server
on rksaph11: not running
...lots of starts/restarts...
17:10:24 pengine[7280]:   notice:  * Restart    prm_nfs_server  
...
17:10:24 crmd[7281]:   notice: Processing graph 6628
(ref=pe_calc-dc-1565622624-7313) derived from
/var/lib/pacemaker/pengine/pe-input-1810.bz2
...monitors are being called...
17:10:24 crmd[7281]:   notice: Result of probe operation for prm_nfs_vg on
h11: 0 (ok)
...the above was the first probe result...
17:10:24 crmd[7281]:  warning: Action 33 (prm_nfs_vg_monitor_0) on h11 failed
(target: 7 vs. rc: 0): Error
...not surprising to me: The resource was running; I don't know why the
cluster want to start it...
17:10:24 crmd[7281]:   notice: Transition 6629 (Complete=9, Pending=0,
Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-1811.bz2): Complete
17:10:24 crmd[7281]:   notice: State transition S_TRANSITION_ENGINE -> S_IDLE

The really bad thing after this is that the "cleaned up" resource still has a
failed status (dated in the past (last-rc-change='Mon Aug 12 04:52:23 2019')),
even though "running".

I tend to believe that the cluster is in a bad state, or the software has a
problem cleaning the status of the monitor.

The CIB status for the resource looks like this:
        <lrm_resource id="prm_nfs_server" type="nfsserver" class="ocf"
provider="heartbeat">
          <lrm_rsc_op id="prm_nfs_server_last_0"
operation_key="prm_nfs_server_start_0" operation="start"
crm-debug-origin="do_update_resource" crm_feature_set="3.0.14"
transition-key="67:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726"
transition-magic="0:0;67:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726"
exit-reason="" on_node="h11" call-id="799" rc-code="0" op-status="0"
interval="0" last-run="1565582351" last-rc-change="1565582351" exec-time="708"
queue-time="0" op-digest="73311a0ef4ba8e9f1f97e05e989f6348"/>
          <lrm_rsc_op id="prm_nfs_server_monitor_60000"
operation_key="prm_nfs_server_monitor_60000" operation="monitor"
crm-debug-origin="do_update_resource" crm_feature_set="3.0.14"
transition-key="68:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726"
transition-magic="0:0;68:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726"
exit-reason="" on_node="h11" call-id="800" rc-code="0" op-status="0"
interval="60000" last-rc-change="1565582351" exec-time="499" queue-time="0"
op-digest="9d8aa17b2a741c8328d7896459733e56"/>
          <lrm_rsc_op id="prm_nfs_server_last_failure_0"
operation_key="prm_nfs_server_monitor_60000" operation="monitor"
crm-debug-origin="do_update_resource" crm_feature_set="3.0.14"
transition-key="4:6568:0:d941efc1-de73-4ee4-b593-f65be9e90726"
transition-magic="0:7;4:6568:0:d941efc1-de73-4ee4-b593-f65be9e90726"
exit-reason="" on_node="h11" call-id="738" rc-code="7" op-status="0"
interval="60000" last-rc-change="1565578343" exec-time="0" queue-time="0"
op-digest="9d8aa17b2a741c8328d7896459733e56"/>
        </lrm_resource>

Regards,
Ulrich

> 
>> Actually the manual says when cleaning up a single primitive, the
>> whole group is cleaned up, unless using ‑‑force. Well ,I don't like
>> this default, as I expect any status change from probe would
>> propagate to the group anyway...
> 
> In 1.1, clean‑up always wipes the history of the affected resources,
> regardless of whether the history is for success or failure. That means
> all the cleaned resources will be reprobed. In 2.0, clean‑up by default
> wipes the history only if there's a failed action (‑‑refresh/‑R is
> required to get the 1.1 behavior). That lessens the impact of the
> "default to whole group" behavior.
> 
> I think the original idea was that a group indicates that the resources
> are closely related, so changing the status of one member might affect
> what status the others report.
> 
>> Regards,
>> Ulrich
> ‑‑ 
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/