[ClusterLabs] Antw: [EXT] Re: Odd result from ping RA
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Fri Feb 25 07:30:37 EST 2022
>>> Reid Wahl <nwahl at redhat.com> schrieb am 25.02.2022 um 12:31 in Nachricht
<CAPiuu99iaCxK4jn9_aFM+Wb98Dz7fPVVTZe-7sfLoryUXRm4Nw at mail.gmail.com>:
> On Thu, Feb 24, 2022 at 2:28 AM Ulrich Windl
> <Ulrich.Windl at rz.uni‑regensburg.de> wrote:
>>
>> Hi!
>>
>> I just discovered this oddity for a SLES15 SP3 cluster:
>> Feb 24 11:16:17 h16 pacemaker‑attrd[7274]: notice: Setting
val_net_gw1[h18]:
> 1000 ‑> 139000
>>
>> That surprised me, because usually the value is 1000 or 0.
>>
>> Diggding a bit further I found:
>> Migration Summary:
>> * Node: h18:
>> * prm_ping_gw1: migration‑threshold=1000000 fail‑count=1
last‑failure='Thu
> Feb 24 11:17:18 2022'
>>
>> Failed Resource Actions:
>> * prm_ping_gw1_monitor_60000 on h18 'error' (1): call=200,
status='Error',
> exitreason='', last‑rc‑change='2022‑02‑24 11:17:18 +01:00', queued=0ms,
exec=0ms
>>
>> Digging further:
>> Feb 24 11:16:17 h18 kernel: BUG: Bad rss‑counter state mm:00000000c620b5fe
> idx:1 val:17
>> Feb 24 11:16:17 h18 pacemaker‑attrd[6946]: notice: Setting
val_net_gw1[h18]:
> 1000 ‑> 139000
>> Feb 24 11:17:17 h18 kernel: traps: pacemaker‑execd[38950] general
protection
> fault ip:7f610e71cbcf sp:7ffff7c25100 error:0 in
> libc‑2.31.so[7f610e63b000+1e6000]
>>
>> (that rss‑counter causing series of core dumps seems to be a new "feature"
of
> SLES15 SP3 kernels that is being investigated by support)
>>
>> Somewhat later:
>> Feb 24 11:17:18 h18 pacemaker‑attrd[6946]: notice: Setting
val_net_gw1[h18]:
> 139000 ‑> (unset)
>> (restarted RA)
>> Feb 24 11:17:21 h18 pacemaker‑attrd[6946]: notice: Setting
val_net_gw1[h18]:
> (unset) ‑> 1000
>>
>> Another node:
>> Feb 24 11:16:17 h19 pacemaker‑attrd[7435]: notice: Setting
val_net_gw1[h18]:
> 1000 ‑> 139000
>> Feb 24 11:17:18 h19 pacemaker‑attrd[7435]: notice: Setting
val_net_gw1[h18]:
> 139000 ‑> (unset)
>> Feb 24 11:17:21 h19 pacemaker‑attrd[7435]: notice: Setting
val_net_gw1[h18]:
> (unset) ‑> 1000
>>
>> So it seems the ping RA sets some garbage value when failing. Is that
> correct?
>
> This is ocf:pacemaker:ping, right? And is use_fping enabled?
Correct. use_fping is not set (default value). I found no fping on the host.
>
> Looks like it uses ($active * $multiplier) ‑‑ see ping_update(). I'm
> assuming your multiplier is 1000.
Corrct: multiplier=1000, and host_list has just one address.
>
> $active is set by either fping_check() or ping_check(), depending on
> your configuration. You can see what they're doing here. I'd assume
> $active is getting set to 139 and then is multiplied by 1000 to set
> $score later.
But wouldn't that mean 139 hosts were pinged successfully?
(${HA_BIN}/pingd is being used)
> ‑
> https://github.com/ClusterLabs/pacemaker/blob/Pacemaker‑2.0.5/extra/resource
> s/ping#L220‑L277
Regards,
Ulrich
>>
>> resource‑agents‑4.8.0+git30.d0077df0‑150300.8.20.1.x86_64
>> pacemaker‑2.0.5+20201202.ba59be712‑150300.4.16.1.x86_64
>>
>> Regards,
>> Ulrich
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
>
> ‑‑
> Regards,
>
> Reid Wahl (He/Him), RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE ‑ Platform Support Delivery ‑ ClusterHA
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list