[ClusterLabs] Antw: [EXT] Re: Q: warning: new_event_notification (4527-22416-14): Broken pipe (32)

Fri Dec 18 11:14:35 EST 2020

On Fri, 2020-12-18 at 13:32 +0100, Ulrich Windl wrote:
> > > > Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 18.12.2020 um
> > > > 12:17 in
> 
> Nachricht <b82fc4d8-689c-4357-8f22-adc957fa698d at gmail.com>:
> > 18.12.2020 12:00, Ulrich Windl пишет:
> > > 
> > > Maybe a related question: Do STONITH resources have special
> > > rules, meaning
> > they don't wait for successful fencing?
> > 
> > pacemaker resources in CIB do not perform fencing. They only
> > register
> > fencing devices with fenced which does actual job. In particular
> > ...
> > 
> > > I saw this between fencing being initiated and fencing being
> > > confirmed (h16
> > was DC, now h18 became DC):
> > > 
> > > Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Processing
> > > graph 0 
> > 
> > (ref=pe_calc-dc-1608280169-21) derived from 
> > /var/lib/pacemaker/pengine/pe-warn-9.bz2
> > > Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Requesting
> > > fencing 
> > 
> > (reboot) of node h16
> > > Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Initiating
> > > start 
> > 
> > operation prm_stonith_sbd_start_0 locally on h18
> > 
> > ... "start" operation on pacemaker stonith resource only registers
> > this
> > device with fenced. It does *not* initiate stonith operation.
> 
> Hi!
> 
> Thanks, it's quite confusing: "notice: Initiating start operation"
> sounds like
> something is to be started right now; if it's just scheduled,
> "notice: Queueing
> start operation" or "notice: Planning start operation" would be a
> better phrase
> IMHO.

I think what Andrei is suggesting is that the start is not a fencing
(reboot/off) action, just a registration of the device with the fencer.
The start (registration) itself does happen at this point. Besides
registration, it also involves a status call to the device.

> > 
> > > ...
> > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  error: Node h18
> > > did not send
> > start result (via controller) within 45000ms (action timeout plus 
> > cluster-delay)
> > 
> > I am not sure what happens here. Somehow fenced took very long time
> > to
> > respond or something with communication between them.
> 
> This looks new in the current pacemaker. As explained in an earlier
> message we
> use a rather long fencing/stonith timeout, so the confirmation may
> arrive
> rather late (but still before the node gets online again). I din't
> see this in
> comparable configurations using older pacemaker.

Only the message is new ... previously it was something inscrutable
like "Timer popped".

Actions involve multiple steps across multiple daemons, so there are
multiple places where something could time out. In this case, the
controller didn't get a result from the fencer in time.

Normally, this should never happen, because if the action itself times
out, the fencer will report that.

> > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  error:
> > > [Action   22]: 
> > 
> > In-flight resource op prm_stonith_sbd_start_0      on h18
> > (priority: 9900, 
> > waiting: (null))
> > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  notice: Transition
> > > 0
> 
> aborted: 
> > Action lost
> > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  warning: rsc_op
> > > 22: 
> > 
> > prm_stonith_sbd_start_0 on h18 timed out

The above is just the consequence of not getting the start result --
the controller considers the action timed out and aborts the current
transition.

> > > ...
> > > Dec 18 09:31:15 h18 pacemaker-controld[4479]:  notice: Peer h16
> > > was 
> > 
> > terminated (reboot) by h18 on behalf of pacemaker-controld.4527: OK

Here the action fencing action returns success

> > > Dec 18 09:31:17 h18 pacemaker-execd[4476]:  notice:
> > > prm_stonith_sbd start 
> > 
> > (call 164) exited with status 0 (execution time 110960ms, queue
> > time
> 
> 15001ms)

The device registration already in progress does eventually complete
successfully. The controller has already considered it lost and moved
on, so this result will be ignored.

Looking at the queue time, I'm going to guess that what happened was
that the status action that's part of the start was queued behind the
reboot action in progress, and the queue time was enough to push it
over the total expected timeout.

This is a known issue when fence actions are serialized. The fencer
starts its timeout once the action is begun (i.e. after queueing), but
the controller doesn't know about the queueing and starts its timeout
once it has submitted the request to the fencer (i.e. before queueing).

Red Hat BZ#1858462 is related, but there's no ClusterLabs BZ at this
point. It's a tricky problem, so it will be a significant project.

> > It could be related to pending fencing but I am not familiar with
> > low
> > level details.
> 
> It looks odd: First "started", then timed out with error, then
> successful
> (without being rescheduled it seems).
> 
> > 
> > > ...
> > > Dec 18 09:31:30 h18 pacemaker-controld[4479]:  notice: Peer h16
> > > was 
> > 
> > terminated (reboot) by h19 on behalf of pacemaker-controld.4479: OK
> > > Dec 18 09:31:30 h18 pacemaker-controld[4479]:  notice: Transition
> > > 0 
> > 
> > (Complete=31, Pending=0, Fired=0, Skipped=1, Incomplete=3, 
> > Source=/var/lib/pacemaker/pengine/pe-warn-9.bz2): Stopped
> 
> So here's the delayed stonith confirmation.
> 
> > > ...
> > > Dec 18 09:31:30 h18 pacemaker-schedulerd[4478]:  warning:
> > > Unexpected result
> > (error) was recorded for start of prm_stonith_sbd on h18 at Dec 18
> > 09:31:14
> > 2020
> > > Dec 18 09:31:30 h18 pacemaker-schedulerd[4478]:  notice:  *
> > > Recover    
> > 
> > prm_stonith_sbd                      (             h18 )
> 
> Then after successful start another "recovery". Isn't that very odd?

The first start was considered timed out, even though it eventually
completed successfully, so the device had to be recovered (stop+start)
due to the start timeout.

> 
> Regards,
> Ulrich
-- 
Ken Gaillot <kgaillot at redhat.com>