[ClusterLabs] attrd/cib out of sync, master scores not updated in CIB after crmd "Respawn" after internal error [NOT cluster partition/rejoin]

Tue Sep 15 07:25:07 EDT 2020

On Fri, Sep 11, 2020 at 11:42:46AM +0200, Lars Ellenberg wrote:
> On Thu, Sep 10, 2020 at 11:18:58AM -0500, Ken Gaillot wrote:
> > > But for some unrelated reason (stress on the cib, IPC timeout),
> > > crmd on the DC was doing an error exit and was respawned:
> > > 
> > >   cib:     info: cib_process_ping:  Reporting our current digest
> > >   crmd:    error: do_pe_invoke_callback:     Could not retrieve the
> > > Cluster Information Base: Timer expired
> > >   ...
> > >   pacemakerd:    error: pcmk_child_exit:   The crmd process (17178)
> > > exited: Generic Pacemaker error (201)
> > >   pacemakerd:   notice: pcmk_process_exit: Respawning failed child
> > > process: crmd
> > > 
> > > The new DC now causes:
> > >   cib:     info: cib_perform_op:    Diff: --- 0.971.201 2
> > >   cib:     info: cib_perform_op:    Diff: +++ 0.971.202 (null)
> > >   cib:     info: cib_perform_op:    --
> > > /cib/status/node_state[@id='2']/transient_attributes[@id='2']
> > >
> > > But the attrd apparently does not notice that transient attributes it
> > > had cached are now gone.
> > 
> > This is a known issue. There was some work done on it in stages that
> > never went anywhere:
> > 
> > https://github.com/ClusterLabs/pacemaker/pull/1695
> > 
> > https://github.com/ClusterLabs/pacemaker/pull/1699
> > 
> > https://github.com/ClusterLabs/pacemaker/pull/2020
> > 
> > The basic idea is that the controller should ask pacemaker-attrd to
> > clear a node's transient attributes rather than doing so directly, so
> > attrd and the CIB stay in sync. Backward compatibility would be tricky.
> > 
> > The fix would only be in Pacemaker 2, since this would require a
> > feature set bump, which can't be backported.
> 
> Thank you for that quick response and all the context above.
> 
> You mention below
> 
> > the controller
> > should request node attribute erasure only if the node leaves the
> > corosync membership, not just the controller CPG.
> 
> Would that be a change that could go into the 1.1.x series?

Suggestion to mitigate the issue:

periodically, for example from a monitor action of a simple resource
agent script, do:

   if attrd_updater -n attrd-canary --update 1; then
     crm_attribute --lifetime reboot --name attrd-canary --query || attrd_updater --refresh 
   fi

Do you see any possible issues with that approach?

    Lars