[ClusterLabs] Antw: [EXT] Re: Q: warning: new_event_notification (4527-22416-14): Broken pipe (32)

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Fri Dec 18 07:32:01 EST 2020


>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 18.12.2020 um 12:17 in
Nachricht <b82fc4d8-689c-4357-8f22-adc957fa698d at gmail.com>:
> 18.12.2020 12:00, Ulrich Windl пишет:
>> 
>> Maybe a related question: Do STONITH resources have special rules, meaning

> they don't wait for successful fencing?
> 
> pacemaker resources in CIB do not perform fencing. They only register
> fencing devices with fenced which does actual job. In particular ...
> 
>> I saw this between fencing being initiated and fencing being confirmed (h16

> was DC, now h18 became DC):
>> 
>> Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Processing graph 0 
> (ref=pe_calc-dc-1608280169-21) derived from 
> /var/lib/pacemaker/pengine/pe-warn-9.bz2
>> Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Requesting fencing 
> (reboot) of node h16
>> Dec 18 09:29:29 h18 pacemaker-controld[4479]:  notice: Initiating start 
> operation prm_stonith_sbd_start_0 locally on h18
> 
> ... "start" operation on pacemaker stonith resource only registers this
> device with fenced. It does *not* initiate stonith operation.

Hi!

Thanks, it's quite confusing: "notice: Initiating start operation" sounds like
something is to be started right now; if it's just scheduled, "notice: Queueing
start operation" or "notice: Planning start operation" would be a better phrase
IMHO.

> 
>> ...
>> Dec 18 09:31:14 h18 pacemaker-controld[4479]:  error: Node h18 did not send

> start result (via controller) within 45000ms (action timeout plus 
> cluster-delay)
> 
> I am not sure what happens here. Somehow fenced took very long time to
> respond or something with communication between them.

This looks new in the current pacemaker. As explained in an earlier message we
use a rather long fencing/stonith timeout, so the confirmation may arrive
rather late (but still before the node gets online again). I din't see this in
comparable configurations using older pacemaker.

> 
>> Dec 18 09:31:14 h18 pacemaker-controld[4479]:  error: [Action   22]: 
> In-flight resource op prm_stonith_sbd_start_0      on h18 (priority: 9900, 
> waiting: (null))
>> Dec 18 09:31:14 h18 pacemaker-controld[4479]:  notice: Transition 0
aborted: 
> Action lost
>> Dec 18 09:31:14 h18 pacemaker-controld[4479]:  warning: rsc_op 22: 
> prm_stonith_sbd_start_0 on h18 timed out
>> ...
>> Dec 18 09:31:15 h18 pacemaker-controld[4479]:  notice: Peer h16 was 
> terminated (reboot) by h18 on behalf of pacemaker-controld.4527: OK
>> Dec 18 09:31:17 h18 pacemaker-execd[4476]:  notice: prm_stonith_sbd start 
> (call 164) exited with status 0 (execution time 110960ms, queue time
15001ms)
> 
> It could be related to pending fencing but I am not familiar with low
> level details.

It looks odd: First "started", then timed out with error, then successful
(without being rescheduled it seems).

> 
>> ...
>> Dec 18 09:31:30 h18 pacemaker-controld[4479]:  notice: Peer h16 was 
> terminated (reboot) by h19 on behalf of pacemaker-controld.4479: OK
>> Dec 18 09:31:30 h18 pacemaker-controld[4479]:  notice: Transition 0 
> (Complete=31, Pending=0, Fired=0, Skipped=1, Incomplete=3, 
> Source=/var/lib/pacemaker/pengine/pe-warn-9.bz2): Stopped

So here's the delayed stonith confirmation.

>> ...
>> Dec 18 09:31:30 h18 pacemaker-schedulerd[4478]:  warning: Unexpected result

> (error) was recorded for start of prm_stonith_sbd on h18 at Dec 18 09:31:14

> 2020
>> Dec 18 09:31:30 h18 pacemaker-schedulerd[4478]:  notice:  * Recover    
> prm_stonith_sbd                      (             h18 )

Then after successful start another "recovery". Isn't that very odd?

Regards,
Ulrich

>> ...
>> 
>> Regards,
>> Ulrich
>> 
>> 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 





More information about the Users mailing list