[ClusterLabs] Antw: Re: Antw: [EXT] Re: Q: warning: new_event_notification (4527-22416-14): Broken pipe (32)

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Thu Jan 7 03:43:08 EST 2021


>>> Reid Wahl <nwahl at redhat.com> schrieb am 07.01.2021 um 07:46 in Nachricht
<CAPiuu99Wbp0sfK8FiKNSq0_0kGT=i7vwc3nHC7VzxKFv+_TH=A at mail.gmail.com>:
> Do you have pcmk_monitor_timeout set? Ken mentioned that the start

Hi!

No, not found in the CIB.

> operation includes a registration and a status call. There was a bug
> in which the status call used the pcmk_monitor_timeout (if set), but
> the start operation that triggered it used the default timeout. So if
> the pcmk_monitor_timeout was long enough (say, 120 seconds) and the
> status call did not complete by then, the start operation would time
> out without receiving an action result. This was fixed in pull
> 2108[1].
> 
> Also, note that the error message said, "Node h18 did not send start
> result (via controller) within 45000ms (action timeout plus
> cluster-delay)". The start operation began at 09:29:29 and ended at
> 09:31:14 (1min45sec). That's because the cluster-delay is 60s by
> default and there was a logging bug[2] at one point: the time value in
> the log message did not add the cluster-delay.
> 
> [1] https://github.com/ClusterLabs/pacemaker/pull/2108 
> [2] https://github.com/ClusterLabs/pacemaker/commit/b542a8f 

Thanks for explaining!

Regards,
Ulrich

> 
> On Fri, Dec 18, 2020 at 8:15 AM Ken Gaillot <kgaillot at redhat.com> wrote:
>>
>> On Fri, 2020-12-18 at 13:32 +0100, Ulrich Windl wrote:
>> > > > > Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 18.12.2020 um
>> > > > > 12:17 in
>> >
>> > Nachricht <b82fc4d8-689c-4357-8f22-adc957fa698d at gmail.com>:
>> > > 18.12.2020 12:00, Ulrich Windl пишет:
>> > > >
>> > > > Maybe a related question: Do STONITH resources have special
>> > > > rules, meaning
>> > > they don't wait for successful fencing?
>> > >
>> > > pacemaker resources in CIB do not perform fencing. They only
>> > > register
>> > > fencing devices with fenced which does actual job. In particular
>> > > ...
>> > >
>> > > > I saw this between fencing being initiated and fencing being
>> > > > confirmed (h16
>> > > was DC, now h18 became DC):
>> > > >
>> > > > Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Processing
>> > > > graph 0
>> > >
>> > > (ref=pe_calc-dc-1608280169-21) derived from
>> > > /var/lib/pacemaker/pengine/pe-warn-9.bz2
>> > > > Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Requesting
>> > > > fencing
>> > >
>> > > (reboot) of node h16
>> > > > Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Initiating
>> > > > start
>> > >
>> > > operation prm_stonith_sbd_start_0 locally on h18
>> > >
>> > > ... "start" operation on pacemaker stonith resource only registers
>> > > this
>> > > device with fenced. It does *not* initiate stonith operation.
>> >
>> > Hi!
>> >
>> > Thanks, it's quite confusing: "notice: Initiating start operation"
>> > sounds like
>> > something is to be started right now; if it's just scheduled,
>> > "notice: Queueing
>> > start operation" or "notice: Planning start operation" would be a
>> > better phrase
>> > IMHO.
>>
>> I think what Andrei is suggesting is that the start is not a fencing
>> (reboot/off) action, just a registration of the device with the fencer.
>> The start (registration) itself does happen at this point. Besides
>> registration, it also involves a status call to the device.
>>
>> > >
>> > > > ...
>> > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  error: Node h18
>> > > > did not send
>> > > start result (via controller) within 45000ms (action timeout plus
>> > > cluster-delay)
>> > >
>> > > I am not sure what happens here. Somehow fenced took very long time
>> > > to
>> > > respond or something with communication between them.
>> >
>> > This looks new in the current pacemaker. As explained in an earlier
>> > message we
>> > use a rather long fencing/stonith timeout, so the confirmation may
>> > arrive
>> > rather late (but still before the node gets online again). I din't
>> > see this in
>> > comparable configurations using older pacemaker.
>>
>> Only the message is new ... previously it was something inscrutable
>> like "Timer popped".
>>
>> Actions involve multiple steps across multiple daemons, so there are
>> multiple places where something could time out. In this case, the
>> controller didn't get a result from the fencer in time.
>>
>> Normally, this should never happen, because if the action itself times
>> out, the fencer will report that.
>>
>> > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  error:
>> > > > [Action   22]:
>> > >
>> > > In-flight resource op prm_stonith_sbd_start_0      on h18
>> > > (priority: 9900,
>> > > waiting: (null))
>> > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  notice: Transition
>> > > > 0
>> >
>> > aborted:
>> > > Action lost
>> > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]:  warning: rsc_op
>> > > > 22:
>> > >
>> > > prm_stonith_sbd_start_0 on h18 timed out
>>
>> The above is just the consequence of not getting the start result --
>> the controller considers the action timed out and aborts the current
>> transition.
>>
>> > > > ...
>> > > > Dec 18 09:31:15 h18 pacemaker-controld[4479]:  notice: Peer h16
>> > > > was
>> > >
>> > > terminated (reboot) by h18 on behalf of pacemaker-controld.4527: OK
>>
>> Here the action fencing action returns success
>>
>> > > > Dec 18 09:31:17 h18 pacemaker-execd[4476]:  notice:
>> > > > prm_stonith_sbd start
>> > >
>> > > (call 164) exited with status 0 (execution time 110960ms, queue
>> > > time
>> >
>> > 15001ms)
>>
>> The device registration already in progress does eventually complete
>> successfully. The controller has already considered it lost and moved
>> on, so this result will be ignored.
>>
>> Looking at the queue time, I'm going to guess that what happened was
>> that the status action that's part of the start was queued behind the
>> reboot action in progress, and the queue time was enough to push it
>> over the total expected timeout.
> 
> Looks plausible. Though it's strange to me that the queue time is only
> 15 seconds, that the reboot operation completes immediately after the
> start operation fails, and that the executor then says the start
> completed successfully after 110s of execution time :/
> 
>> This is a known issue when fence actions are serialized. The fencer
>> starts its timeout once the action is begun (i.e. after queueing), but
>> the controller doesn't know about the queueing and starts its timeout
>> once it has submitted the request to the fencer (i.e. before queueing).
>>
>> Red Hat BZ#1858462 is related, but there's no ClusterLabs BZ at this
>> point. It's a tricky problem, so it will be a significant project.
>>
>> > > It could be related to pending fencing but I am not familiar with
>> > > low
>> > > level details.
>> >
>> > It looks odd: First "started", then timed out with error, then
>> > successful
>> > (without being rescheduled it seems).
>> >
>> > >
>> > > > ...
>> > > > Dec 18 09:31:30 h18 pacemaker-controld[4479]:  notice: Peer h16
>> > > > was
>> > >
>> > > terminated (reboot) by h19 on behalf of pacemaker-controld.4479: OK
>> > > > Dec 18 09:31:30 h18 pacemaker-controld[4479]:  notice: Transition
>> > > > 0
>> > >
>> > > (Complete=31, Pending=0, Fired=0, Skipped=1, Incomplete=3,
>> > > Source=/var/lib/pacemaker/pengine/pe-warn-9.bz2): Stopped
>> >
>> > So here's the delayed stonith confirmation.
>> >
>> > > > ...
>> > > > Dec 18 09:31:30 h18 pacemaker-schedulerd[4478]:  warning:
>> > > > Unexpected result
>> > > (error) was recorded for start of prm_stonith_sbd on h18 at Dec 18
>> > > 09:31:14
>> > > 2020
>> > > > Dec 18 09:31:30 h18 pacemaker-schedulerd[4478]:  notice:  *
>> > > > Recover
>> > >
>> > > prm_stonith_sbd                      (             h18 )
>> >
>> > Then after successful start another "recovery". Isn't that very odd?
>>
>> The first start was considered timed out, even though it eventually
>> completed successfully, so the device had to be recovered (stop+start)
>> due to the start timeout.
>>
>> >
>> > Regards,
>> > Ulrich
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> -- 
> Regards,
> 
> Reid Wahl, RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 





More information about the Users mailing list