[ClusterLabs] Antw: [EXT] Re: Q: warning: new_event_notification (4527-22416-14): Broken pipe (32)

Thu Jan 7 02:04:23 EST 2021

Steffen, did you mean to reply to a different thread? I ask because
there's another active one and you haven't been involved in this one
yet :)

On Wed, Jan 6, 2021 at 11:01 PM Steffen Vinther Sørensen
<svinther at gmail.com> wrote:
>
> No I don't set that, and its not showing up in output of 'pcs stonith
> show --full'
>
> On Thu, Jan 7, 2021 at 7:46 AM Reid Wahl <nwahl at redhat.com> wrote:
> >
> > Do you have pcmk_monitor_timeout set? Ken mentioned that the start
> > operation includes a registration and a status call. There was a bug
> > in which the status call used the pcmk_monitor_timeout (if set), but
> > the start operation that triggered it used the default timeout. So if
> > the pcmk_monitor_timeout was long enough (say, 120 seconds) and the
> > status call did not complete by then, the start operation would time
> > out without receiving an action result. This was fixed in pull
> > 2108[1].
> >
> > Also, note that the error message said, "Node h18 did not send start
> > result (via controller) within 45000ms (action timeout plus
> > cluster-delay)". The start operation began at 09:29:29 and ended at
> > 09:31:14 (1min45sec). That's because the cluster-delay is 60s by
> > default and there was a logging bug[2] at one point: the time value in
> > the log message did not add the cluster-delay.
> >
> > [1] https://github.com/ClusterLabs/pacemaker/pull/2108
> > [2] https://github.com/ClusterLabs/pacemaker/commit/b542a8f
> >
> > On Fri, Dec 18, 2020 at 8:15 AM Ken Gaillot <kgaillot at redhat.com> wrote:
> > >
> > > On Fri, 2020-12-18 at 13:32 +0100, Ulrich Windl wrote:
> > > > > > > Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 18.12.2020 um
> > > > > > > 12:17 in
> > > >
> > > > Nachricht <b82fc4d8-689c-4357-8f22-adc957fa698d at gmail.com>:
> > > > > 18.12.2020 12:00, Ulrich Windl пишет:
> > > > > >
> > > > > > Maybe a related question: Do STONITH resources have special
> > > > > > rules, meaning
> > > > > they don't wait for successful fencing?
> > > > >
> > > > > pacemaker resources in CIB do not perform fencing. They only
> > > > > register
> > > > > fencing devices with fenced which does actual job. In particular
> > > > > ...
> > > > >
> > > > > > I saw this between fencing being initiated and fencing being
> > > > > > confirmed (h16
> > > > > was DC, now h18 became DC):
> > > > > >
> > > > > > Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Processing
> > > > > > graph 0
> > > > >
> > > > > (ref=pe_calc-dc-1608280169-21) derived from
> > > > > /var/lib/pacemaker/pengine/pe-warn-9.bz2
> > > > > > Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Requesting
> > > > > > fencing
> > > > >
> > > > > (reboot) of node h16
> > > > > > Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Initiating
> > > > > > start
> > > > >
> > > > > operation prm_stonith_sbd_start_0 locally on h18
> > > > >
> > > > > ... "start" operation on pacemaker stonith resource only registers
> > > > > this
> > > > > device with fenced. It does *not* initiate stonith operation.
> > > >
> > > > Hi!
> > > >
> > > > Thanks, it's quite confusing: "notice: Initiating start operation"
> > > > sounds like
> > > > something is to be started right now; if it's just scheduled,
> > > > "notice: Queueing
> > > > start operation" or "notice: Planning start operation" would be a
> > > > better phrase
> > > > IMHO.
> > >
> > > I think what Andrei is suggesting is that the start is not a fencing
> > > (reboot/off) action, just a registration of the device with the fencer.
> > > The start (registration) itself does happen at this point. Besides
> > > registration, it also involves a status call to the device.
> > >
> > > > >
> > > > > > ...
> > > > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  error: Node h18
> > > > > > did not send
> > > > > start result (via controller) within 45000ms (action timeout plus
> > > > > cluster-delay)
> > > > >
> > > > > I am not sure what happens here. Somehow fenced took very long time
> > > > > to
> > > > > respond or something with communication between them.
> > > >
> > > > This looks new in the current pacemaker. As explained in an earlier
> > > > message we
> > > > use a rather long fencing/stonith timeout, so the confirmation may
> > > > arrive
> > > > rather late (but still before the node gets online again). I din't
> > > > see this in
> > > > comparable configurations using older pacemaker.
> > >
> > > Only the message is new ... previously it was something inscrutable
> > > like "Timer popped".
> > >
> > > Actions involve multiple steps across multiple daemons, so there are
> > > multiple places where something could time out. In this case, the
> > > controller didn't get a result from the fencer in time.
> > >
> > > Normally, this should never happen, because if the action itself times
> > > out, the fencer will report that.
> > >
> > > > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  error:
> > > > > > [Action   22]:
> > > > >
> > > > > In-flight resource op prm_stonith_sbd_start_0      on h18
> > > > > (priority: 9900,
> > > > > waiting: (null))
> > > > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  notice: Transition
> > > > > > 0
> > > >
> > > > aborted:
> > > > > Action lost
> > > > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  warning: rsc_op
> > > > > > 22:
> > > > >
> > > > > prm_stonith_sbd_start_0 on h18 timed out
> > >
> > > The above is just the consequence of not getting the start result --
> > > the controller considers the action timed out and aborts the current
> > > transition.
> > >
> > > > > > ...
> > > > > > Dec 18 09:31:15 h18 pacemaker-controld[4479]:  notice: Peer h16
> > > > > > was
> > > > >
> > > > > terminated (reboot) by h18 on behalf of pacemaker-controld.4527: OK
> > >
> > > Here the action fencing action returns success
> > >
> > > > > > Dec 18 09:31:17 h18 pacemaker-execd[4476]:  notice:
> > > > > > prm_stonith_sbd start
> > > > >
> > > > > (call 164) exited with status 0 (execution time 110960ms, queue
> > > > > time
> > > >
> > > > 15001ms)
> > >
> > > The device registration already in progress does eventually complete
> > > successfully. The controller has already considered it lost and moved
> > > on, so this result will be ignored.
> > >
> > > Looking at the queue time, I'm going to guess that what happened was
> > > that the status action that's part of the start was queued behind the
> > > reboot action in progress, and the queue time was enough to push it
> > > over the total expected timeout.
> >
> > Looks plausible. Though it's strange to me that the queue time is only
> > 15 seconds, that the reboot operation completes immediately after the
> > start operation fails, and that the executor then says the start
> > completed successfully after 110s of execution time :/
> >
> > > This is a known issue when fence actions are serialized. The fencer
> > > starts its timeout once the action is begun (i.e. after queueing), but
> > > the controller doesn't know about the queueing and starts its timeout
> > > once it has submitted the request to the fencer (i.e. before queueing).
> > >
> > > Red Hat BZ#1858462 is related, but there's no ClusterLabs BZ at this
> > > point. It's a tricky problem, so it will be a significant project.
> > >
> > > > > It could be related to pending fencing but I am not familiar with
> > > > > low
> > > > > level details.
> > > >
> > > > It looks odd: First "started", then timed out with error, then
> > > > successful
> > > > (without being rescheduled it seems).
> > > >
> > > > >
> > > > > > ...
> > > > > > Dec 18 09:31:30 h18 pacemaker-controld[4479]:  notice: Peer h16
> > > > > > was
> > > > >
> > > > > terminated (reboot) by h19 on behalf of pacemaker-controld.4479: OK
> > > > > > Dec 18 09:31:30 h18 pacemaker-controld[4479]:  notice: Transition
> > > > > > 0
> > > > >
> > > > > (Complete=31, Pending=0, Fired=0, Skipped=1, Incomplete=3,
> > > > > Source=/var/lib/pacemaker/pengine/pe-warn-9.bz2): Stopped
> > > >
> > > > So here's the delayed stonith confirmation.
> > > >
> > > > > > ...
> > > > > > Dec 18 09:31:30 h18 pacemaker-schedulerd[4478]:  warning:
> > > > > > Unexpected result
> > > > > (error) was recorded for start of prm_stonith_sbd on h18 at Dec 18
> > > > > 09:31:14
> > > > > 2020
> > > > > > Dec 18 09:31:30 h18 pacemaker-schedulerd[4478]:  notice:  *
> > > > > > Recover
> > > > >
> > > > > prm_stonith_sbd                      (             h18 )
> > > >
> > > > Then after successful start another "recovery". Isn't that very odd?
> > >
> > > The first start was considered timed out, even though it eventually
> > > completed successfully, so the device had to be recovered (stop+start)
> > > due to the start timeout.
> > >
> > > >
> > > > Regards,
> > > > Ulrich
> > > --
> > > Ken Gaillot <kgaillot at redhat.com>
> > >
> > > _______________________________________________
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > ClusterLabs home: https://www.clusterlabs.org/
> >
> >
> >
> > --
> > Regards,
> >
> > Reid Wahl, RHCA
> > Senior Software Maintenance Engineer, Red Hat
> > CEE - Platform Support Delivery - ClusterHA
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA