[ClusterLabs] Antw: [EXT] Re: Q: warning: new_event_notification (4527-22416-14): Broken pipe (32)
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Fri Dec 18 07:32:01 EST 2020
>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 18.12.2020 um 12:17 in
Nachricht <b82fc4d8-689c-4357-8f22-adc957fa698d at gmail.com>:
> 18.12.2020 12:00, Ulrich Windl пишет:
>>
>> Maybe a related question: Do STONITH resources have special rules, meaning
> they don't wait for successful fencing?
>
> pacemaker resources in CIB do not perform fencing. They only register
> fencing devices with fenced which does actual job. In particular ...
>
>> I saw this between fencing being initiated and fencing being confirmed (h16
> was DC, now h18 became DC):
>>
>> Dec 18 09:29:29 h18 pacemaker-controld[4479]: notice: Processing graph 0
> (ref=pe_calc-dc-1608280169-21) derived from
> /var/lib/pacemaker/pengine/pe-warn-9.bz2
>> Dec 18 09:29:29 h18 pacemaker-controld[4479]: notice: Requesting fencing
> (reboot) of node h16
>> Dec 18 09:29:29 h18 pacemaker-controld[4479]: notice: Initiating start
> operation prm_stonith_sbd_start_0 locally on h18
>
> ... "start" operation on pacemaker stonith resource only registers this
> device with fenced. It does *not* initiate stonith operation.
Hi!
Thanks, it's quite confusing: "notice: Initiating start operation" sounds like
something is to be started right now; if it's just scheduled, "notice: Queueing
start operation" or "notice: Planning start operation" would be a better phrase
IMHO.
>
>> ...
>> Dec 18 09:31:14 h18 pacemaker-controld[4479]: error: Node h18 did not send
> start result (via controller) within 45000ms (action timeout plus
> cluster-delay)
>
> I am not sure what happens here. Somehow fenced took very long time to
> respond or something with communication between them.
This looks new in the current pacemaker. As explained in an earlier message we
use a rather long fencing/stonith timeout, so the confirmation may arrive
rather late (but still before the node gets online again). I din't see this in
comparable configurations using older pacemaker.
>
>> Dec 18 09:31:14 h18 pacemaker-controld[4479]: error: [Action 22]:
> In-flight resource op prm_stonith_sbd_start_0 on h18 (priority: 9900,
> waiting: (null))
>> Dec 18 09:31:14 h18 pacemaker-controld[4479]: notice: Transition 0
aborted:
> Action lost
>> Dec 18 09:31:14 h18 pacemaker-controld[4479]: warning: rsc_op 22:
> prm_stonith_sbd_start_0 on h18 timed out
>> ...
>> Dec 18 09:31:15 h18 pacemaker-controld[4479]: notice: Peer h16 was
> terminated (reboot) by h18 on behalf of pacemaker-controld.4527: OK
>> Dec 18 09:31:17 h18 pacemaker-execd[4476]: notice: prm_stonith_sbd start
> (call 164) exited with status 0 (execution time 110960ms, queue time
15001ms)
>
> It could be related to pending fencing but I am not familiar with low
> level details.
It looks odd: First "started", then timed out with error, then successful
(without being rescheduled it seems).
>
>> ...
>> Dec 18 09:31:30 h18 pacemaker-controld[4479]: notice: Peer h16 was
> terminated (reboot) by h19 on behalf of pacemaker-controld.4479: OK
>> Dec 18 09:31:30 h18 pacemaker-controld[4479]: notice: Transition 0
> (Complete=31, Pending=0, Fired=0, Skipped=1, Incomplete=3,
> Source=/var/lib/pacemaker/pengine/pe-warn-9.bz2): Stopped
So here's the delayed stonith confirmation.
>> ...
>> Dec 18 09:31:30 h18 pacemaker-schedulerd[4478]: warning: Unexpected result
> (error) was recorded for start of prm_stonith_sbd on h18 at Dec 18 09:31:14
> 2020
>> Dec 18 09:31:30 h18 pacemaker-schedulerd[4478]: notice: * Recover
> prm_stonith_sbd ( h18 )
Then after successful start another "recovery". Isn't that very odd?
Regards,
Ulrich
>> ...
>>
>> Regards,
>> Ulrich
>>
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list