[ClusterLabs] attrd/cib out of sync, master scores not updated in CIB after crmd "Respawn" after internal error [NOT cluster partition/rejoin]

Thu Sep 10 12:18:58 EDT 2020

On Thu, 2020-09-10 at 16:03 +0200, Lars Ellenberg wrote:
> Now with "reproducer" ... see below
> 
> On Thu, Sep 10, 2020 at 11:55:20AM +0200, Lars Ellenberg wrote:
> > Hi there.
> > 
> > I've seen a scenario where a network "hickup" isolated the current
> > DC in a 3
> > node cluster for a short time; other partition elected a new DC
> > obviously, and
> > all node attributes of the former DC are "cleared" together with
> > the rest of
> > its state.
> 
> I have to correct myself here.
> Network and membership remained stable, even the CIB CPG did not
> notice anything.
> 
> But for some unrelated reason (stress on the cib, IPC timeout),
> crmd on the DC was doing an error exit and was respawned:
> 
>   cib:     info: cib_process_ping:  Reporting our current digest
>   crmd:    error: do_pe_invoke_callback:     Could not retrieve the
> Cluster Information Base: Timer expired
>   ...
>   pacemakerd:    error: pcmk_child_exit:   The crmd process (17178)
> exited: Generic Pacemaker error (201)
>   pacemakerd:   notice: pcmk_process_exit: Respawning failed child
> process: crmd
> 
> The new DC now causes:
>   cib:     info: cib_perform_op:    Diff: --- 0.971.201 2
>   cib:     info: cib_perform_op:    Diff: +++ 0.971.202 (null)
>   cib:     info: cib_perform_op:    --
> /cib/status/node_state[@id='2']/transient_attributes[@id='2']
>
> But the attrd apparently does not notice that transient attributes it
> had cached are now gone.

This is a known issue. There was some work done on it in stages that
never went anywhere:

https://github.com/ClusterLabs/pacemaker/pull/1695

https://github.com/ClusterLabs/pacemaker/pull/1699

https://github.com/ClusterLabs/pacemaker/pull/2020

The basic idea is that the controller should ask pacemaker-attrd to
clear a node's transient attributes rather than doing so directly, so
attrd and the CIB stay in sync. Backward compatibility would be tricky.

The fix would only be in Pacemaker 2, since this would require a
feature set bump, which can't be backported.

> Reprobes are going on, and all give the expected results.
> But unchanged (from the perspective of the attrd on the former DC,
> the one with the crmd Respawn) master scores will not be re-populated
> to the CIB, preventing a later switchover of the Master role
> (that is when it became apparent that something was wrong).
> 
> A "reproducer" in the sense of "reproduces approximate behavior",
> even if not the exact scenario (crmd emergency respawn and DC re-
> election):
> 
>  * have a healthy cluster with some master scores set
>  * delete transient node attributes:
>    cibadm -D --xpath
> "/cib/status/node_state[@id='2']/transient_attributes[@id='2']"
>     (or whatever your node id is; the resource should not be promoted
> on
>     that node at that time, or this will result in resource
> "recovery"
>     actions, which will change the master score, and we have a
> different effect)
> 
> Any cached node attributes (master scores) on that node
> will "never" make it to the CIB (until they eventually change their
> value).
> 
> How can this be fixed?
>    * for the "cibadmin -D" case? (do we even want to?)

It should be possible for attrd to get notifications for CIB changes
and update its internal caches accordingly. (I thought it already did
that for permanent node attribute changes, but I can't find where that
happens.)

>    * for the "DC re-election" and one crmd "temporarily not
> available"
>      case as in the scenario described here?
>      (I think we should)

As described above, I think attrd should always remain in control of
CIB modifications for node attributes. I'm also thinking the controller
should request node attribute erasure only if the node leaves the
corosync membership, not just the controller CPG.

> > All nodes rejoin, "all happy again", BUT ...
> > the attrd of the former DC apparently had some cached node
> > attribute values,
> > which are now no longer present in the cib.
> > Specifically, some master scores.
> > So the master scores for the former DC (that was lost, then
> > rejoined) are now
> > "only" in its attrd, but (as long as they don't change) will never
> > be flushed
> > to the CIB.
> > 
> > The policy engine therefore no longer considers this node as a
> > possible
> > promotion candidate.
> > 
> > Again: the master score did not change, not from the perspective of
> > the attrd
> > on the node which was isolated for a short time, anyways.
> > 
> > But since that node "left", the two-node partition deleted the node
> > state of
> > the "lost" node (including master scores).
> > Then that node rejoined.
> > 
> > Now, I have a cib without that master score, an attrd with that
> > master score
> > value still "cached", and some periodic monitor that will just
> > reset this same
> > (already cached in attrd) master score.
> > But that apparently will never reach the CIB.
> > 
> > So.
> > Question is: anyone seen anything like that before?
> > Could that be fixed already?
> > Version in that scenario was: 1.1.20+ (almost .21).
> > 
> > Obviously "stonith" would have fixed it,
> > then that node would not have just rejoined, but rebooted, then
> > rejoined,
> > and its attrd would not have any cached values anymore ;-)
> > 
> > I suppose attrd attributes should sync with the last CIB on re-
> > join?
> > I'd hope it does something like that already?
> > If it does nothing yet, then maybe that's the obvious fix.
> > If it does something, then maybe this boils down to some funky
> > timing issue?
> > 
> > How would I go about trying to create a reproducer?
> > 
> 
> Thanks,
>  
>      Lars
-- 
Ken Gaillot <kgaillot at redhat.com>