[ClusterLabs] Antw: Re: Antw: [EXT] Another odd message: pacemaker-fenced[31326]: warning: Can't create a sane reply

Thu Feb 11 13:13:06 EST 2021

On Thu, 2021-02-11 at 13:59 +0100, Ulrich Windl wrote:
> Hi!
> 
> After that problem I see this in crm_mon output:
> Failed Fencing Actions:
>   * reboot of h16 failed: delegate=h18, client=pacemaker-
> controld.9087,
> origin=h18, last-failed='2021-02-09 14:50:18 +01:00'
>
> Is there a way to clean that up?

stonith_admin --cleanup -H h16 (or '*')

or equivalent in higher-level tool

> BTW: h16 had been booted today and still this message is there.

Yes, that's a feature. :) As long as any node remains up, they will
sync history with each other. That ensures the view is the same
regardless of what node you run the command on.

> Regards,
> Ulrich
> 
> > > > Ulrich Windl schrieb am 09.02.2021 um 16:32 in Nachricht
> > > > <6022AB1C.645 :
> 
> 161 :
> 60728>:
> > > > > Klaus Wenninger <kwenning at redhat.com> schrieb am 09.02.2021
> > > > > um 16:12 in
> > 
> > Nachricht <f828ec0d-7cc5-36b4-ba6b-9aed4b94992f at redhat.com>:
> > > On 2/9/21 3:10 PM, Ulrich Windl wrote:
> > > > > > > "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de>
> > > > > > > schrieb am
> 
> 09.02.2021
> > > > um
> > > > 15:00 in Nachricht <
> > > > 60229563020000A10003ED82 at gwsmtp.uni-regensburg.de>:
> > > > > Hi!
> > > > > 
> > > > > I had made a mistake, leading to node h16 to be fenced. After
> > > > > recovery
> 
> (h16
> > > > > had re‑joined the cluster) I had stopped the node,
> > > > > reconfigured the
> 
> network,
> > > > > then started the node again.
> > > > > Then I did the same thing (not the unwanted fencing) with
> > > > > h18. When I 
> > > > > started the node again, I saw these unexpected messages:
> > > > > 
> > > > > Feb 09 14:50:18 h18 pacemaker‑fenced[31326]:  warning:
> > > > > received pending
> > > > > action we are supposed to be the owner but it's not in our
> > > > > records ‑>
> 
> fail
> > > > it
> > > 
> > > Looks like some part of your cluster still had kept the pending
> > > fence
> 
> action
> > > around when h18 was fencing h16. Can be that the node wasn't
> > > around
> > > when this was successful or it can have to do with an issue we
> > > had
> 
> recently
> > 
> > The node definitely was "around" when h16 had been fenced, so it
> > must be the
> > other rerason (lingering around).
> > 
> > > that in certain cases pending fencing actions weren't properly
> > > deleted.
> > > This part of the code got a major overhaul recently and the code-
> > > parts
> > > referred to by e.g. the assertion aren't there anymore.
> > > That we are seeing this assertion makes me think, you hit the
> > > case
> > > with the lingering pending fencing actions (think the lingering
> > > one is a
> > > relayed one and looks a bit different than a plain one and thus
> > > might
> > > trigger
> > > the assertion).
> > > 
> > > Klaus
> > > > > Feb 09 14:50:18 h18 pacemaker‑fenced[31326]:  error:
> > > > > Operation 'reboot'
> > > > > targeting h16 on <no‑one> for pacemaker‑
> > > > > controld.9087 at h18.ad643f10: No
> 
> route
> > > > to 
> > > > > host
> > > > > Feb 09 14:50:18 h18 pacemaker‑fenced[31326]:  error:
> > > > 
> > > > stonith_construct_reply: 
> > > > > Triggered assert at fenced_commands.c:2363 : request != NULL
> > > > > Feb 09 14:50:18 h18 pacemaker‑fenced[31326]:  warning: Can't
> > > > > create a
> 
> sane 
> > > > > reply
> > > > > Feb 09 14:50:18 h18 pacemaker‑controld[31330]:  notice: Peer
> > > > > h16 was not
> > > > > terminated (reboot) by <anyone> on behalf of
> > > > > pacemaker‑controld.9087:
> 
> No
> > > > route 
> > > > > to host
> > > > > 
> > > > > On the "No route to host": I could ping h16 from h18 using
> > > > > the host name
> > > > > without any problem.
> > > > > 
> > > > > Two points:
> > > > > Why would h18 think h16 should be fenced?
> > > > > The gailed asserztion looks like a programming error.
> > > > 
> > > > "failed assertion", sorry!
> > > > 
> > > > > Explanations?
> > > > > 
> > > > > Regards,
> > > > > Ulrich
> > > > > 
> > > > > 
> > > > > 
> > > > > _______________________________________________
> > > > > Manage your subscription:
> > > > > https://lists.clusterlabs.org/mailman/listinfo/users 
> > > > > 
> > > > > ClusterLabs home: https://www.clusterlabs.org/ 
> > > > 
> > > > 
> > > > _______________________________________________
> > > > Manage your subscription:
> > > > https://lists.clusterlabs.org/mailman/listinfo/users 
> > > > 
> > > > ClusterLabs home: https://www.clusterlabs.org/ 
> > > 
> > > _______________________________________________
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users 
> > > 
> > > ClusterLabs home: https://www.clusterlabs.org/ 
> > 
> > 
> > 
> > 
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>