[ClusterLabs] Odd result from ping RA

Fri Feb 25 06:34:17 EST 2022

On Fri, Feb 25, 2022 at 3:31 AM Reid Wahl <nwahl at redhat.com> wrote:
>
> On Thu, Feb 24, 2022 at 2:28 AM Ulrich Windl
> <Ulrich.Windl at rz.uni-regensburg.de> wrote:
> >
> > Hi!
> >
> > I just discovered this oddity for a SLES15 SP3 cluster:
> > Feb 24 11:16:17 h16 pacemaker-attrd[7274]:  notice: Setting val_net_gw1[h18]: 1000 -> 139000
> >
> > That surprised me, because usually the value is 1000 or 0.
> >
> > Diggding a bit further I found:
> > Migration Summary:
> >   * Node: h18:
> >     * prm_ping_gw1: migration-threshold=1000000 fail-count=1 last-failure='Thu Feb 24 11:17:18 2022'
> >
> > Failed Resource Actions:
> >   * prm_ping_gw1_monitor_60000 on h18 'error' (1): call=200, status='Error', exitreason='', last-rc-change='2022-02-24 11:17:18 +01:00', queued=0ms, exec=0ms
> >
> > Digging further:
> > Feb 24 11:16:17 h18 kernel: BUG: Bad rss-counter state mm:00000000c620b5fe idx:1 val:17
> > Feb 24 11:16:17 h18 pacemaker-attrd[6946]:  notice: Setting val_net_gw1[h18]: 1000 -> 139000
> > Feb 24 11:17:17 h18 kernel: traps: pacemaker-execd[38950] general protection fault ip:7f610e71cbcf sp:7ffff7c25100 error:0 in libc-2.31.so[7f610e63b000+1e6000]
> >
> > (that rss-counter causing series of core dumps seems to be a new "feature" of SLES15 SP3 kernels that is being investigated by support)
> >
> > Somewhat later:
> > Feb 24 11:17:18 h18 pacemaker-attrd[6946]:  notice: Setting val_net_gw1[h18]: 139000 -> (unset)
> > (restarted RA)
> > Feb 24 11:17:21 h18 pacemaker-attrd[6946]:  notice: Setting val_net_gw1[h18]: (unset) -> 1000
> >
> > Another node:
> > Feb 24 11:16:17 h19 pacemaker-attrd[7435]:  notice: Setting val_net_gw1[h18]: 1000 -> 139000
> > Feb 24 11:17:18 h19 pacemaker-attrd[7435]:  notice: Setting val_net_gw1[h18]: 139000 -> (unset)
> > Feb 24 11:17:21 h19 pacemaker-attrd[7435]:  notice: Setting val_net_gw1[h18]: (unset) -> 1000
> >
> > So it seems the ping RA sets some garbage value when failing. Is that correct?
>
> This is ocf:pacemaker:ping, right? And is use_fping enabled?
>
> Looks like it uses ($active * $multiplier) -- see ping_update(). I'm
> assuming your multiplier is 1000.
>
> $active is set by either fping_check() or ping_check(), depending on
> your configuration. You can see what they're doing here. I'd assume
> $active is getting set to 139 and then is multiplied by 1000 to set
> $score later.
>   - https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.5/extra/resources/ping#L220-L277

It could also be a side effect of the fault though, since I don't see
anything in fping_check() or ping_check() that's an obvious candidate
for setting active=139 unless you have a massive host list.
> >
> > resource-agents-4.8.0+git30.d0077df0-150300.8.20.1.x86_64
> > pacemaker-2.0.5+20201202.ba59be712-150300.4.16.1.x86_64
> >
> > Regards,
> > Ulrich
> >
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
>
> --
> Regards,
>
> Reid Wahl (He/Him), RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA


-- 
Regards,

Reid Wahl (He/Him), RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA