[ClusterLabs] attrd/cib out of sync, master scores not updated in CIB after crmd "Respawn" after internal error [NOT cluster partition/rejoin]

Thu Sep 17 15:00:42 EDT 2020

On Tue, 2020-09-15 at 13:25 +0200, Lars Ellenberg wrote:
> On Fri, Sep 11, 2020 at 11:42:46AM +0200, Lars Ellenberg wrote:
> > On Thu, Sep 10, 2020 at 11:18:58AM -0500, Ken Gaillot wrote:
> > > > But for some unrelated reason (stress on the cib, IPC timeout),
> > > > crmd on the DC was doing an error exit and was respawned:
> > > > 
> > > >   cib:     info: cib_process_ping:  Reporting our current
> > > > digest
> > > >   crmd:    error: do_pe_invoke_callback:     Could not retrieve
> > > > the
> > > > Cluster Information Base: Timer expired
> > > >   ...
> > > >   pacemakerd:    error: pcmk_child_exit:   The crmd process
> > > > (17178)
> > > > exited: Generic Pacemaker error (201)
> > > >   pacemakerd:   notice: pcmk_process_exit: Respawning failed
> > > > child
> > > > process: crmd
> > > > 
> > > > The new DC now causes:
> > > >   cib:     info: cib_perform_op:    Diff: --- 0.971.201 2
> > > >   cib:     info: cib_perform_op:    Diff: +++ 0.971.202 (null)
> > > >   cib:     info: cib_perform_op:    --
> > > > /cib/status/node_state[@id='2']/transient_attributes[@id='2']
> > > > 
> > > > But the attrd apparently does not notice that transient
> > > > attributes it
> > > > had cached are now gone.
> > > 
> > > This is a known issue. There was some work done on it in stages
> > > that
> > > never went anywhere:
> > > 
> > > https://github.com/ClusterLabs/pacemaker/pull/1695
> > > 
> > > https://github.com/ClusterLabs/pacemaker/pull/1699
> > > 
> > > https://github.com/ClusterLabs/pacemaker/pull/2020
> > > 
> > > The basic idea is that the controller should ask pacemaker-attrd
> > > to
> > > clear a node's transient attributes rather than doing so
> > > directly, so
> > > attrd and the CIB stay in sync. Backward compatibility would be
> > > tricky.
> > > 
> > > The fix would only be in Pacemaker 2, since this would require a
> > > feature set bump, which can't be backported.
> > 
> > Thank you for that quick response and all the context above.
> > 
> > You mention below
> > 
> > > the controller
> > > should request node attribute erasure only if the node leaves the
> > > corosync membership, not just the controller CPG.
> > 
> > Would that be a change that could go into the 1.1.x series?
> 
> Suggestion to mitigate the issue:
> 
> periodically, for example from a monitor action of a simple resource
> agent script, do:
> 
>    if attrd_updater -n attrd-canary --update 1; then
>      crm_attribute --lifetime reboot --name attrd-canary --query ||
> attrd_updater --refresh 
>    fi
> 
> Do you see any possible issues with that approach?
> 
>     Lars

That should work.
-- 
Ken Gaillot <kgaillot at redhat.com>