[ClusterLabs] Antw: [EXT] Re: Odd result from ping RA

Fri Feb 25 14:15:41 EST 2022

On Fri, Feb 25, 2022 at 4:31 AM Ulrich Windl
<Ulrich.Windl at rz.uni-regensburg.de> wrote:
>
> >>> Reid Wahl <nwahl at redhat.com> schrieb am 25.02.2022 um 12:31 in Nachricht
> <CAPiuu99iaCxK4jn9_aFM+Wb98Dz7fPVVTZe-7sfLoryUXRm4Nw at mail.gmail.com>:
> > On Thu, Feb 24, 2022 at 2:28 AM Ulrich Windl
> > <Ulrich.Windl at rz.uni‑regensburg.de> wrote:
> >>
> >> Hi!
> >>
> >> I just discovered this oddity for a SLES15 SP3 cluster:
> >> Feb 24 11:16:17 h16 pacemaker‑attrd[7274]:  notice: Setting
> val_net_gw1[h18]:
> > 1000 ‑> 139000
> >>
> >> That surprised me, because usually the value is 1000 or 0.
> >>
> >> Diggding a bit further I found:
> >> Migration Summary:
> >>   * Node: h18:
> >>     * prm_ping_gw1: migration‑threshold=1000000 fail‑count=1
> last‑failure='Thu
> > Feb 24 11:17:18 2022'
> >>
> >> Failed Resource Actions:
> >>   * prm_ping_gw1_monitor_60000 on h18 'error' (1): call=200,
> status='Error',
> > exitreason='', last‑rc‑change='2022‑02‑24 11:17:18 +01:00', queued=0ms,
> exec=0ms
> >>
> >> Digging further:
> >> Feb 24 11:16:17 h18 kernel: BUG: Bad rss‑counter state mm:00000000c620b5fe
>
> > idx:1 val:17
> >> Feb 24 11:16:17 h18 pacemaker‑attrd[6946]:  notice: Setting
> val_net_gw1[h18]:
> > 1000 ‑> 139000
> >> Feb 24 11:17:17 h18 kernel: traps: pacemaker‑execd[38950] general
> protection
> > fault ip:7f610e71cbcf sp:7ffff7c25100 error:0 in
> > libc‑2.31.so[7f610e63b000+1e6000]
> >>
> >> (that rss‑counter causing series of core dumps seems to be a new "feature"
> of
> > SLES15 SP3 kernels that is being investigated by support)
> >>
> >> Somewhat later:
> >> Feb 24 11:17:18 h18 pacemaker‑attrd[6946]:  notice: Setting
> val_net_gw1[h18]:
> > 139000 ‑> (unset)
> >> (restarted RA)
> >> Feb 24 11:17:21 h18 pacemaker‑attrd[6946]:  notice: Setting
> val_net_gw1[h18]:
> > (unset) ‑> 1000
> >>
> >> Another node:
> >> Feb 24 11:16:17 h19 pacemaker‑attrd[7435]:  notice: Setting
> val_net_gw1[h18]:
> > 1000 ‑> 139000
> >> Feb 24 11:17:18 h19 pacemaker‑attrd[7435]:  notice: Setting
> val_net_gw1[h18]:
> > 139000 ‑> (unset)
> >> Feb 24 11:17:21 h19 pacemaker‑attrd[7435]:  notice: Setting
> val_net_gw1[h18]:
> > (unset) ‑> 1000
> >>
> >> So it seems the ping RA sets some garbage value when failing. Is that
> > correct?
> >
> > This is ocf:pacemaker:ping, right? And is use_fping enabled?
>
> Correct. use_fping is not set (default value). I found no fping on the host.
>
>
> >
> > Looks like it uses ($active * $multiplier) ‑‑ see ping_update(). I'm
> > assuming your multiplier is 1000.
>
> Corrct: multiplier=1000, and host_list has just one address.
>
> >
> > $active is set by either fping_check() or ping_check(), depending on
> > your configuration. You can see what they're doing here. I'd assume
> > $active is getting set to 139 and then is multiplied by 1000 to set
> > $score later.
>
> But wouldn't that mean 139 hosts were pinged successfully?
> (${HA_BIN}/pingd is being used)

Yeah, that seems to be the intent. Hence my saying "It could also be a
side effect of the fault though, since I don't see anything in
fping_check() or ping_check() that's an obvious candidate for setting
active=139 unless you have a massive host list."

>
> >   ‑
> > https://github.com/ClusterLabs/pacemaker/blob/Pacemaker‑2.0.5/extra/resource
>
> > s/ping#L220‑L277
>
>
> Regards,
> Ulrich
>
> >>
> >> resource‑agents‑4.8.0+git30.d0077df0‑150300.8.20.1.x86_64
> >> pacemaker‑2.0.5+20201202.ba59be712‑150300.4.16.1.x86_64
> >>
> >> Regards,
> >> Ulrich
> >>
> >>
> >> _______________________________________________
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >>
> >
> >
> > ‑‑
> > Regards,
> >
> > Reid Wahl (He/Him), RHCA
> > Senior Software Maintenance Engineer, Red Hat
> > CEE ‑ Platform Support Delivery ‑ ClusterHA
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

-- 
Regards,

Reid Wahl (He/Him), RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA