[ClusterLabs] attrd/cib out of sync, master scores not updated in CIB after crmd "Respawn" after internal error [NOT cluster partition/rejoin]
Lars Ellenberg
lars.ellenberg at linbit.com
Tue Sep 15 07:25:07 EDT 2020
On Fri, Sep 11, 2020 at 11:42:46AM +0200, Lars Ellenberg wrote:
> On Thu, Sep 10, 2020 at 11:18:58AM -0500, Ken Gaillot wrote:
> > > But for some unrelated reason (stress on the cib, IPC timeout),
> > > crmd on the DC was doing an error exit and was respawned:
> > >
> > > cib: info: cib_process_ping: Reporting our current digest
> > > crmd: error: do_pe_invoke_callback: Could not retrieve the
> > > Cluster Information Base: Timer expired
> > > ...
> > > pacemakerd: error: pcmk_child_exit: The crmd process (17178)
> > > exited: Generic Pacemaker error (201)
> > > pacemakerd: notice: pcmk_process_exit: Respawning failed child
> > > process: crmd
> > >
> > > The new DC now causes:
> > > cib: info: cib_perform_op: Diff: --- 0.971.201 2
> > > cib: info: cib_perform_op: Diff: +++ 0.971.202 (null)
> > > cib: info: cib_perform_op: --
> > > /cib/status/node_state[@id='2']/transient_attributes[@id='2']
> > >
> > > But the attrd apparently does not notice that transient attributes it
> > > had cached are now gone.
> >
> > This is a known issue. There was some work done on it in stages that
> > never went anywhere:
> >
> > https://github.com/ClusterLabs/pacemaker/pull/1695
> >
> > https://github.com/ClusterLabs/pacemaker/pull/1699
> >
> > https://github.com/ClusterLabs/pacemaker/pull/2020
> >
> > The basic idea is that the controller should ask pacemaker-attrd to
> > clear a node's transient attributes rather than doing so directly, so
> > attrd and the CIB stay in sync. Backward compatibility would be tricky.
> >
> > The fix would only be in Pacemaker 2, since this would require a
> > feature set bump, which can't be backported.
>
> Thank you for that quick response and all the context above.
>
> You mention below
>
> > the controller
> > should request node attribute erasure only if the node leaves the
> > corosync membership, not just the controller CPG.
>
> Would that be a change that could go into the 1.1.x series?
Suggestion to mitigate the issue:
periodically, for example from a monitor action of a simple resource
agent script, do:
if attrd_updater -n attrd-canary --update 1; then
crm_attribute --lifetime reboot --name attrd-canary --query || attrd_updater --refresh
fi
Do you see any possible issues with that approach?
Lars
More information about the Users
mailing list