[ClusterLabs] Odd result from ping RA
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Thu Feb 24 05:28:14 EST 2022
Hi!
I just discovered this oddity for a SLES15 SP3 cluster:
Feb 24 11:16:17 h16 pacemaker-attrd[7274]: notice: Setting val_net_gw1[h18]: 1000 -> 139000
That surprised me, because usually the value is 1000 or 0.
Diggding a bit further I found:
Migration Summary:
* Node: h18:
* prm_ping_gw1: migration-threshold=1000000 fail-count=1 last-failure='Thu Feb 24 11:17:18 2022'
Failed Resource Actions:
* prm_ping_gw1_monitor_60000 on h18 'error' (1): call=200, status='Error', exitreason='', last-rc-change='2022-02-24 11:17:18 +01:00', queued=0ms, exec=0ms
Digging further:
Feb 24 11:16:17 h18 kernel: BUG: Bad rss-counter state mm:00000000c620b5fe idx:1 val:17
Feb 24 11:16:17 h18 pacemaker-attrd[6946]: notice: Setting val_net_gw1[h18]: 1000 -> 139000
Feb 24 11:17:17 h18 kernel: traps: pacemaker-execd[38950] general protection fault ip:7f610e71cbcf sp:7ffff7c25100 error:0 in libc-2.31.so[7f610e63b000+1e6000]
(that rss-counter causing series of core dumps seems to be a new "feature" of SLES15 SP3 kernels that is being investigated by support)
Somewhat later:
Feb 24 11:17:18 h18 pacemaker-attrd[6946]: notice: Setting val_net_gw1[h18]: 139000 -> (unset)
(restarted RA)
Feb 24 11:17:21 h18 pacemaker-attrd[6946]: notice: Setting val_net_gw1[h18]: (unset) -> 1000
Another node:
Feb 24 11:16:17 h19 pacemaker-attrd[7435]: notice: Setting val_net_gw1[h18]: 1000 -> 139000
Feb 24 11:17:18 h19 pacemaker-attrd[7435]: notice: Setting val_net_gw1[h18]: 139000 -> (unset)
Feb 24 11:17:21 h19 pacemaker-attrd[7435]: notice: Setting val_net_gw1[h18]: (unset) -> 1000
So it seems the ping RA sets some garbage value when failing. Is that correct?
resource-agents-4.8.0+git30.d0077df0-150300.8.20.1.x86_64
pacemaker-2.0.5+20201202.ba59be712-150300.4.16.1.x86_64
Regards,
Ulrich
More information about the Users
mailing list