[ClusterLabs] Antw: [EXT] Re: Q: warning: new_event_notification (4527-22416-14): Broken pipe (32)

Thu Jan 7 01:46:17 EST 2021

Do you have pcmk_monitor_timeout set? Ken mentioned that the start
operation includes a registration and a status call. There was a bug
in which the status call used the pcmk_monitor_timeout (if set), but
the start operation that triggered it used the default timeout. So if
the pcmk_monitor_timeout was long enough (say, 120 seconds) and the
status call did not complete by then, the start operation would time
out without receiving an action result. This was fixed in pull
2108[1].

Also, note that the error message said, "Node h18 did not send start
result (via controller) within 45000ms (action timeout plus
cluster-delay)". The start operation began at 09:29:29 and ended at
09:31:14 (1min45sec). That's because the cluster-delay is 60s by
default and there was a logging bug[2] at one point: the time value in
the log message did not add the cluster-delay.

[1] https://github.com/ClusterLabs/pacemaker/pull/2108
[2] https://github.com/ClusterLabs/pacemaker/commit/b542a8f

On Fri, Dec 18, 2020 at 8:15 AM Ken Gaillot <kgaillot at redhat.com> wrote:
>
> On Fri, 2020-12-18 at 13:32 +0100, Ulrich Windl wrote:
> > > > > Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 18.12.2020 um
> > > > > 12:17 in
> >
> > Nachricht <b82fc4d8-689c-4357-8f22-adc957fa698d at gmail.com>:
> > > 18.12.2020 12:00, Ulrich Windl пишет:
> > > >
> > > > Maybe a related question: Do STONITH resources have special
> > > > rules, meaning
> > > they don't wait for successful fencing?
> > >
> > > pacemaker resources in CIB do not perform fencing. They only
> > > register
> > > fencing devices with fenced which does actual job. In particular
> > > ...
> > >
> > > > I saw this between fencing being initiated and fencing being
> > > > confirmed (h16
> > > was DC, now h18 became DC):
> > > >
> > > > Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Processing
> > > > graph 0
> > >
> > > (ref=pe_calc-dc-1608280169-21) derived from
> > > /var/lib/pacemaker/pengine/pe-warn-9.bz2
> > > > Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Requesting
> > > > fencing
> > >
> > > (reboot) of node h16
> > > > Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Initiating
> > > > start
> > >
> > > operation prm_stonith_sbd_start_0 locally on h18
> > >
> > > ... "start" operation on pacemaker stonith resource only registers
> > > this
> > > device with fenced. It does *not* initiate stonith operation.
> >
> > Hi!
> >
> > Thanks, it's quite confusing: "notice: Initiating start operation"
> > sounds like
> > something is to be started right now; if it's just scheduled,
> > "notice: Queueing
> > start operation" or "notice: Planning start operation" would be a
> > better phrase
> > IMHO.
>
> I think what Andrei is suggesting is that the start is not a fencing
> (reboot/off) action, just a registration of the device with the fencer.
> The start (registration) itself does happen at this point. Besides
> registration, it also involves a status call to the device.
>
> > >
> > > > ...
> > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  error: Node h18
> > > > did not send
> > > start result (via controller) within 45000ms (action timeout plus
> > > cluster-delay)
> > >
> > > I am not sure what happens here. Somehow fenced took very long time
> > > to
> > > respond or something with communication between them.
> >
> > This looks new in the current pacemaker. As explained in an earlier
> > message we
> > use a rather long fencing/stonith timeout, so the confirmation may
> > arrive
> > rather late (but still before the node gets online again). I din't
> > see this in
> > comparable configurations using older pacemaker.
>
> Only the message is new ... previously it was something inscrutable
> like "Timer popped".
>
> Actions involve multiple steps across multiple daemons, so there are
> multiple places where something could time out. In this case, the
> controller didn't get a result from the fencer in time.
>
> Normally, this should never happen, because if the action itself times
> out, the fencer will report that.
>
> > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  error:
> > > > [Action   22]:
> > >
> > > In-flight resource op prm_stonith_sbd_start_0      on h18
> > > (priority: 9900,
> > > waiting: (null))
> > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  notice: Transition
> > > > 0
> >
> > aborted:
> > > Action lost
> > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  warning: rsc_op
> > > > 22:
> > >
> > > prm_stonith_sbd_start_0 on h18 timed out
>
> The above is just the consequence of not getting the start result --
> the controller considers the action timed out and aborts the current
> transition.
>
> > > > ...
> > > > Dec 18 09:31:15 h18 pacemaker-controld[4479]:  notice: Peer h16
> > > > was
> > >
> > > terminated (reboot) by h18 on behalf of pacemaker-controld.4527: OK
>
> Here the action fencing action returns success
>
> > > > Dec 18 09:31:17 h18 pacemaker-execd[4476]:  notice:
> > > > prm_stonith_sbd start
> > >
> > > (call 164) exited with status 0 (execution time 110960ms, queue
> > > time
> >
> > 15001ms)
>
> The device registration already in progress does eventually complete
> successfully. The controller has already considered it lost and moved
> on, so this result will be ignored.
>
> Looking at the queue time, I'm going to guess that what happened was
> that the status action that's part of the start was queued behind the
> reboot action in progress, and the queue time was enough to push it
> over the total expected timeout.

Looks plausible. Though it's strange to me that the queue time is only
15 seconds, that the reboot operation completes immediately after the
start operation fails, and that the executor then says the start
completed successfully after 110s of execution time :/

> This is a known issue when fence actions are serialized. The fencer
> starts its timeout once the action is begun (i.e. after queueing), but
> the controller doesn't know about the queueing and starts its timeout
> once it has submitted the request to the fencer (i.e. before queueing).
>
> Red Hat BZ#1858462 is related, but there's no ClusterLabs BZ at this
> point. It's a tricky problem, so it will be a significant project.
>
> > > It could be related to pending fencing but I am not familiar with
> > > low
> > > level details.
> >
> > It looks odd: First "started", then timed out with error, then
> > successful
> > (without being rescheduled it seems).
> >
> > >
> > > > ...
> > > > Dec 18 09:31:30 h18 pacemaker-controld[4479]:  notice: Peer h16
> > > > was
> > >
> > > terminated (reboot) by h19 on behalf of pacemaker-controld.4479: OK
> > > > Dec 18 09:31:30 h18 pacemaker-controld[4479]:  notice: Transition
> > > > 0
> > >
> > > (Complete=31, Pending=0, Fired=0, Skipped=1, Incomplete=3,
> > > Source=/var/lib/pacemaker/pengine/pe-warn-9.bz2): Stopped
> >
> > So here's the delayed stonith confirmation.
> >
> > > > ...
> > > > Dec 18 09:31:30 h18 pacemaker-schedulerd[4478]:  warning:
> > > > Unexpected result
> > > (error) was recorded for start of prm_stonith_sbd on h18 at Dec 18
> > > 09:31:14
> > > 2020
> > > > Dec 18 09:31:30 h18 pacemaker-schedulerd[4478]:  notice:  *
> > > > Recover
> > >
> > > prm_stonith_sbd                      (             h18 )
> >
> > Then after successful start another "recovery". Isn't that very odd?
>
> The first start was considered timed out, even though it eventually
> completed successfully, so the device had to be recovered (stop+start)
> due to the start timeout.
>
> >
> > Regards,
> > Ulrich
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA