[ClusterLabs] attrd/cib out of sync, master scores not updated in CIB after crmd "Respawn" after internal error [NOT cluster partition/rejoin]
Ken Gaillot
kgaillot at redhat.com
Thu Sep 17 15:00:42 EDT 2020
On Tue, 2020-09-15 at 13:25 +0200, Lars Ellenberg wrote:
> On Fri, Sep 11, 2020 at 11:42:46AM +0200, Lars Ellenberg wrote:
> > On Thu, Sep 10, 2020 at 11:18:58AM -0500, Ken Gaillot wrote:
> > > > But for some unrelated reason (stress on the cib, IPC timeout),
> > > > crmd on the DC was doing an error exit and was respawned:
> > > >
> > > > cib: info: cib_process_ping: Reporting our current
> > > > digest
> > > > crmd: error: do_pe_invoke_callback: Could not retrieve
> > > > the
> > > > Cluster Information Base: Timer expired
> > > > ...
> > > > pacemakerd: error: pcmk_child_exit: The crmd process
> > > > (17178)
> > > > exited: Generic Pacemaker error (201)
> > > > pacemakerd: notice: pcmk_process_exit: Respawning failed
> > > > child
> > > > process: crmd
> > > >
> > > > The new DC now causes:
> > > > cib: info: cib_perform_op: Diff: --- 0.971.201 2
> > > > cib: info: cib_perform_op: Diff: +++ 0.971.202 (null)
> > > > cib: info: cib_perform_op: --
> > > > /cib/status/node_state[@id='2']/transient_attributes[@id='2']
> > > >
> > > > But the attrd apparently does not notice that transient
> > > > attributes it
> > > > had cached are now gone.
> > >
> > > This is a known issue. There was some work done on it in stages
> > > that
> > > never went anywhere:
> > >
> > > https://github.com/ClusterLabs/pacemaker/pull/1695
> > >
> > > https://github.com/ClusterLabs/pacemaker/pull/1699
> > >
> > > https://github.com/ClusterLabs/pacemaker/pull/2020
> > >
> > > The basic idea is that the controller should ask pacemaker-attrd
> > > to
> > > clear a node's transient attributes rather than doing so
> > > directly, so
> > > attrd and the CIB stay in sync. Backward compatibility would be
> > > tricky.
> > >
> > > The fix would only be in Pacemaker 2, since this would require a
> > > feature set bump, which can't be backported.
> >
> > Thank you for that quick response and all the context above.
> >
> > You mention below
> >
> > > the controller
> > > should request node attribute erasure only if the node leaves the
> > > corosync membership, not just the controller CPG.
> >
> > Would that be a change that could go into the 1.1.x series?
>
> Suggestion to mitigate the issue:
>
> periodically, for example from a monitor action of a simple resource
> agent script, do:
>
> if attrd_updater -n attrd-canary --update 1; then
> crm_attribute --lifetime reboot --name attrd-canary --query ||
> attrd_updater --refresh
> fi
>
> Do you see any possible issues with that approach?
>
> Lars
That should work.
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list