[ClusterLabs] Antw: [EXT] Re: Odd result from ping RA

Fri Feb 25 07:30:37 EST 2022

>>> Reid Wahl <nwahl at redhat.com> schrieb am 25.02.2022 um 12:31 in Nachricht
<CAPiuu99iaCxK4jn9_aFM+Wb98Dz7fPVVTZe-7sfLoryUXRm4Nw at mail.gmail.com>:
> On Thu, Feb 24, 2022 at 2:28 AM Ulrich Windl
> <Ulrich.Windl at rz.uni‑regensburg.de> wrote:
>>
>> Hi!
>>
>> I just discovered this oddity for a SLES15 SP3 cluster:
>> Feb 24 11:16:17 h16 pacemaker‑attrd[7274]:  notice: Setting
val_net_gw1[h18]: 
> 1000 ‑> 139000
>>
>> That surprised me, because usually the value is 1000 or 0.
>>
>> Diggding a bit further I found:
>> Migration Summary:
>>   * Node: h18:
>>     * prm_ping_gw1: migration‑threshold=1000000 fail‑count=1
last‑failure='Thu 
> Feb 24 11:17:18 2022'
>>
>> Failed Resource Actions:
>>   * prm_ping_gw1_monitor_60000 on h18 'error' (1): call=200,
status='Error', 
> exitreason='', last‑rc‑change='2022‑02‑24 11:17:18 +01:00', queued=0ms,
exec=0ms
>>
>> Digging further:
>> Feb 24 11:16:17 h18 kernel: BUG: Bad rss‑counter state mm:00000000c620b5fe

> idx:1 val:17
>> Feb 24 11:16:17 h18 pacemaker‑attrd[6946]:  notice: Setting
val_net_gw1[h18]: 
> 1000 ‑> 139000
>> Feb 24 11:17:17 h18 kernel: traps: pacemaker‑execd[38950] general
protection 
> fault ip:7f610e71cbcf sp:7ffff7c25100 error:0 in 
> libc‑2.31.so[7f610e63b000+1e6000]
>>
>> (that rss‑counter causing series of core dumps seems to be a new "feature"
of 
> SLES15 SP3 kernels that is being investigated by support)
>>
>> Somewhat later:
>> Feb 24 11:17:18 h18 pacemaker‑attrd[6946]:  notice: Setting
val_net_gw1[h18]: 
> 139000 ‑> (unset)
>> (restarted RA)
>> Feb 24 11:17:21 h18 pacemaker‑attrd[6946]:  notice: Setting
val_net_gw1[h18]: 
> (unset) ‑> 1000
>>
>> Another node:
>> Feb 24 11:16:17 h19 pacemaker‑attrd[7435]:  notice: Setting
val_net_gw1[h18]: 
> 1000 ‑> 139000
>> Feb 24 11:17:18 h19 pacemaker‑attrd[7435]:  notice: Setting
val_net_gw1[h18]: 
> 139000 ‑> (unset)
>> Feb 24 11:17:21 h19 pacemaker‑attrd[7435]:  notice: Setting
val_net_gw1[h18]: 
> (unset) ‑> 1000
>>
>> So it seems the ping RA sets some garbage value when failing. Is that 
> correct?
> 
> This is ocf:pacemaker:ping, right? And is use_fping enabled?

Correct. use_fping is not set (default value). I found no fping on the host.

> 
> Looks like it uses ($active * $multiplier) ‑‑ see ping_update(). I'm
> assuming your multiplier is 1000.

Corrct: multiplier=1000, and host_list has just one address.

> 
> $active is set by either fping_check() or ping_check(), depending on
> your configuration. You can see what they're doing here. I'd assume
> $active is getting set to 139 and then is multiplied by 1000 to set
> $score later.

But wouldn't that mean 139 hosts were pinged successfully?
(${HA_BIN}/pingd is being used)

>   ‑ 
> https://github.com/ClusterLabs/pacemaker/blob/Pacemaker‑2.0.5/extra/resource

> s/ping#L220‑L277

Regards,
Ulrich

>>
>> resource‑agents‑4.8.0+git30.d0077df0‑150300.8.20.1.x86_64
>> pacemaker‑2.0.5+20201202.ba59be712‑150300.4.16.1.x86_64
>>
>> Regards,
>> Ulrich
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
>>
> 
> 
> ‑‑ 
> Regards,
> 
> Reid Wahl (He/Him), RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE ‑ Platform Support Delivery ‑ ClusterHA
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/