[ClusterLabs] Q: "Re-initiated expired calculated failure"

Wed Aug 14 04:24:51 EDT 2019

(subject changed for existing thread)

Hi!

After I had thought the problem with the sticky failed monitor was solved
eventually, I realized that I'm getting a message that I don't really
understand after each cluster recheck interval:

pengine[7280]:   notice: Re-initiated expired calculated failure
prm_nfs_server_monitor_60000 (rc=7,
magic=0:7;4:6568:0:d941efc1-de73-4ee4-b593-f65be9e90726) on h11

The message repeats absolutely identical. So what does it mean? The monitor
did not fail between cluster rechecks, and crm_mon is not displaying any failed
operations.

Regards,
Ulrich

>>> Ulrich Windl schrieb am 13.08.2019 um 11:06 in Nachricht <5D527D91.124 :
161 :
60728>:
> Hi,
> 
> an update:
> After setting a failure-timeout for the resource that stale monitor failure

> was removed automatically at next cluster recheck (it seems).
> Still I wonder why a resource cleanup didn't do that (bug?).
> 
> Regards,
> Ulrich
> 
> 
> >>> "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de> schrieb am 13.08.2019
um
> 10:07 in Nachricht <5D526FB0020000A100032D82 at gwsmtp.uni-regensburg.de>:
> >>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 13.08.2019 um 01:03 in
> > Nachricht
> > <ec355e98371d22d6c83e4927f65fdfa711266c21.camel at redhat.com>:
> >> On Mon, 2019‑08‑12 at 17:46 +0200, Ulrich Windl wrote:
> >>> Hi!
> >>> 
> >>> I just noticed that a "crm resource cleanup <rsc>" caused some
> >>> unexpected behavior and the syslog message:
> >>> crmd[7281]:  warning: new_event_notification (7281‑97955‑15): Broken
> >>> pipe (32)
> >>> 
> >>> It's SLES14 SP4 last updated Sept. 2018 (up since then, pacemaker‑
> >>> 1.1.19+20180928.0d2680780‑1.8.x86_64).
> >>> 
> >>> The cleanup was due to a failed monitor. As an unexpected consequence
> >>> of this cleanup, CRM seemed to restart the complete resource (and
> >>> dependencies), even though it was running.
> >> 
> >> I assume the monitor failure was old, and recovery had already
> >> completed? If not, recovery might have been initiated before the clean‑
> >> up was recorded.
> >> 
> >>> I noticed that a manual "crm_resource ‑C ‑r <rsc> ‑N <node>" command
> >>> has the same effect (multiple resources are "Cleaned up", resources
> >>> are restarted seemingly before the "probe" is done.).
> >> 
> >> Can you verify whether the probes were done? The DC should log a
> >> message when each <rsc>_monitor_0 result comes in.
> > 
> > So here's a rough sketch of events:
> > 17:10:23 crmd[7281]:   notice: State transition S_IDLE -> S_POLICY_ENGINE
> > ...no probes yet...
> > 17:10:24 pengine[7280]:  warning: Processing failed monitor of 
> > prm_nfs_server
> > on rksaph11: not running
> > ...lots of starts/restarts...
> > 17:10:24 pengine[7280]:   notice:  * Restart    prm_nfs_server  
> > ...
> > 17:10:24 crmd[7281]:   notice: Processing graph 6628
> > (ref=pe_calc-dc-1565622624-7313) derived from
> > /var/lib/pacemaker/pengine/pe-input-1810.bz2
> > ...monitors are being called...
> > 17:10:24 crmd[7281]:   notice: Result of probe operation for prm_nfs_vg
on
> > h11: 0 (ok)
> > ...the above was the first probe result...
> > 17:10:24 crmd[7281]:  warning: Action 33 (prm_nfs_vg_monitor_0) on h11 
> > failed
> > (target: 7 vs. rc: 0): Error
> > ...not surprising to me: The resource was running; I don't know why the
> > cluster want to start it...
> > 17:10:24 crmd[7281]:   notice: Transition 6629 (Complete=9, Pending=0,
> > Fired=0, Skipped=0, Incomplete=0,
> > Source=/var/lib/pacemaker/pengine/pe-input-1811.bz2): Complete
> > 17:10:24 crmd[7281]:   notice: State transition S_TRANSITION_ENGINE -> 
> S_IDLE
> > 
> > The really bad thing after this is that the "cleaned up" resource still
has 
> > a
> > failed status (dated in the past (last-rc-change='Mon Aug 12 04:52:23 
> > 2019')),
> > even though "running".
> > 
> > I tend to believe that the cluster is in a bad state, or the software has
a
> > problem cleaning the status of the monitor.
> > 
> > The CIB status for the resource looks like this:
> >         <lrm_resource id="prm_nfs_server" type="nfsserver" class="ocf"
> > provider="heartbeat">
> >           <lrm_rsc_op id="prm_nfs_server_last_0"
> > operation_key="prm_nfs_server_start_0" operation="start"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14"
> > transition-key="67:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> > transition-magic="0:0;67:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> > exit-reason="" on_node="h11" call-id="799" rc-code="0" op-status="0"
> > interval="0" last-run="1565582351" last-rc-change="1565582351" 
> > exec-time="708"
> > queue-time="0" op-digest="73311a0ef4ba8e9f1f97e05e989f6348"/>
> >           <lrm_rsc_op id="prm_nfs_server_monitor_60000"
> > operation_key="prm_nfs_server_monitor_60000" operation="monitor"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14"
> > transition-key="68:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> > transition-magic="0:0;68:6583:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> > exit-reason="" on_node="h11" call-id="800" rc-code="0" op-status="0"
> > interval="60000" last-rc-change="1565582351" exec-time="499"
queue-time="0"
> > op-digest="9d8aa17b2a741c8328d7896459733e56"/>
> >           <lrm_rsc_op id="prm_nfs_server_last_failure_0"
> > operation_key="prm_nfs_server_monitor_60000" operation="monitor"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.14"
> > transition-key="4:6568:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> > transition-magic="0:7;4:6568:0:d941efc1-de73-4ee4-b593-f65be9e90726"
> > exit-reason="" on_node="h11" call-id="738" rc-code="7" op-status="0"
> > interval="60000" last-rc-change="1565578343" exec-time="0" queue-time="0"
> > op-digest="9d8aa17b2a741c8328d7896459733e56"/>
> >         </lrm_resource>
> > 
> > 
> > Regards,
> > Ulrich
> > 
> >> 
> >>> Actually the manual says when cleaning up a single primitive, the
> >>> whole group is cleaned up, unless using ‑‑force. Well ,I don't like
> >>> this default, as I expect any status change from probe would
> >>> propagate to the group anyway...
> >> 
> >> In 1.1, clean‑up always wipes the history of the affected resources,
> >> regardless of whether the history is for success or failure. That means
> >> all the cleaned resources will be reprobed. In 2.0, clean‑up by default
> >> wipes the history only if there's a failed action (‑‑refresh/‑R is
> >> required to get the 1.1 behavior). That lessens the impact of the
> >> "default to whole group" behavior.
> >> 
> >> I think the original idea was that a group indicates that the resources
> >> are closely related, so changing the status of one member might affect
> >> what status the others report.
> >> 
> >>> Regards,
> >>> Ulrich
> >> ‑‑ 
> >> Ken Gaillot <kgaillot at redhat.com>
> >> 
> >> _______________________________________________
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users 
> >> 
> >> ClusterLabs home: https://www.clusterlabs.org/ 
> > 
> > 
> > 
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
>