[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Another odd message: pacemaker-fenced[31326]: warning: Can't create a sane reply

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Fri Feb 12 02:55:47 EST 2021


>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 11.02.2021 um 19:13 in
Nachricht
<5ddea954b8e8a45cf73a7a169752146e27f69083.camel at redhat.com>:
> On Thu, 2021-02-11 at 13:59 +0100, Ulrich Windl wrote:
>> Hi!
>> 
>> After that problem I see this in crm_mon output:
>> Failed Fencing Actions:
>>   * reboot of h16 failed: delegate=h18, client=pacemaker-
>> controld.9087,
>> origin=h18, last-failed='2021-02-09 14:50:18 +01:00'
>>
>> Is there a way to clean that up?
> 

Hi Ken!

> stonith_admin --cleanup -H h16 (or '*')

;-) Somehow I had expected crm_resource to be able to do that...

I can confirm that it worked!

> 
> or equivalent in higher-level tool
> 
>> BTW: h16 had been booted today and still this message is there.
> 
> Yes, that's a feature. :) As long as any node remains up, they will
> sync history with each other. That ensures the view is the same
> regardless of what node you run the command on.

Yeah, but a "failed reboot" is somewhat obsolete after the node actually had
booted.

Regards,
Ulrich

> 
>> Regards,
>> Ulrich
>> 
>> > > > Ulrich Windl schrieb am 09.02.2021 um 16:32 in Nachricht
>> > > > <6022AB1C.645 :
>> 
>> 161 :
>> 60728>:
>> > > > > Klaus Wenninger <kwenning at redhat.com> schrieb am 09.02.2021
>> > > > > um 16:12 in
>> > 
>> > Nachricht <f828ec0d-7cc5-36b4-ba6b-9aed4b94992f at redhat.com>:
>> > > On 2/9/21 3:10 PM, Ulrich Windl wrote:
>> > > > > > > "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de>
>> > > > > > > schrieb am
>> 
>> 09.02.2021
>> > > > um
>> > > > 15:00 in Nachricht <
>> > > > 60229563020000A10003ED82 at gwsmtp.uni-regensburg.de>:
>> > > > > Hi!
>> > > > > 
>> > > > > I had made a mistake, leading to node h16 to be fenced. After
>> > > > > recovery
>> 
>> (h16
>> > > > > had re‑joined the cluster) I had stopped the node,
>> > > > > reconfigured the
>> 
>> network,
>> > > > > then started the node again.
>> > > > > Then I did the same thing (not the unwanted fencing) with
>> > > > > h18. When I 
>> > > > > started the node again, I saw these unexpected messages:
>> > > > > 
>> > > > > Feb 09 14:50:18 h18 pacemaker‑fenced[31326]:  warning:
>> > > > > received pending
>> > > > > action we are supposed to be the owner but it's not in our
>> > > > > records ‑>
>> 
>> fail
>> > > > it
>> > > 
>> > > Looks like some part of your cluster still had kept the pending
>> > > fence
>> 
>> action
>> > > around when h18 was fencing h16. Can be that the node wasn't
>> > > around
>> > > when this was successful or it can have to do with an issue we
>> > > had
>> 
>> recently
>> > 
>> > The node definitely was "around" when h16 had been fenced, so it
>> > must be the
>> > other rerason (lingering around).
>> > 
>> > > that in certain cases pending fencing actions weren't properly
>> > > deleted.
>> > > This part of the code got a major overhaul recently and the code-
>> > > parts
>> > > referred to by e.g. the assertion aren't there anymore.
>> > > That we are seeing this assertion makes me think, you hit the
>> > > case
>> > > with the lingering pending fencing actions (think the lingering
>> > > one is a
>> > > relayed one and looks a bit different than a plain one and thus
>> > > might
>> > > trigger
>> > > the assertion).
>> > > 
>> > > Klaus
>> > > > > Feb 09 14:50:18 h18 pacemaker‑fenced[31326]:  error:
>> > > > > Operation 'reboot'
>> > > > > targeting h16 on <no‑one> for pacemaker‑
>> > > > > controld.9087 at h18.ad643f10: No
>> 
>> route
>> > > > to 
>> > > > > host
>> > > > > Feb 09 14:50:18 h18 pacemaker‑fenced[31326]:  error:
>> > > > 
>> > > > stonith_construct_reply: 
>> > > > > Triggered assert at fenced_commands.c:2363 : request != NULL
>> > > > > Feb 09 14:50:18 h18 pacemaker‑fenced[31326]:  warning: Can't
>> > > > > create a
>> 
>> sane 
>> > > > > reply
>> > > > > Feb 09 14:50:18 h18 pacemaker‑controld[31330]:  notice: Peer
>> > > > > h16 was not
>> > > > > terminated (reboot) by <anyone> on behalf of
>> > > > > pacemaker‑controld.9087:
>> 
>> No
>> > > > route 
>> > > > > to host
>> > > > > 
>> > > > > On the "No route to host": I could ping h16 from h18 using
>> > > > > the host name
>> > > > > without any problem.
>> > > > > 
>> > > > > Two points:
>> > > > > Why would h18 think h16 should be fenced?
>> > > > > The gailed asserztion looks like a programming error.
>> > > > 
>> > > > "failed assertion", sorry!
>> > > > 
>> > > > > Explanations?
>> > > > > 
>> > > > > Regards,
>> > > > > Ulrich
>> > > > > 
>> > > > > 
>> > > > > 
>> > > > > _______________________________________________
>> > > > > Manage your subscription:
>> > > > > https://lists.clusterlabs.org/mailman/listinfo/users 
>> > > > > 
>> > > > > ClusterLabs home: https://www.clusterlabs.org/ 
>> > > > 
>> > > > 
>> > > > _______________________________________________
>> > > > Manage your subscription:
>> > > > https://lists.clusterlabs.org/mailman/listinfo/users 
>> > > > 
>> > > > ClusterLabs home: https://www.clusterlabs.org/ 
>> > > 
>> > > _______________________________________________
>> > > Manage your subscription:
>> > > https://lists.clusterlabs.org/mailman/listinfo/users 
>> > > 
>> > > ClusterLabs home: https://www.clusterlabs.org/ 
>> > 
>> > 
>> > 
>> > 
>> 
>> 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
> -- 
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 





More information about the Users mailing list