[ClusterLabs] Peer (slave) node deleting master's transient_attributes

Mon Feb 1 11:43:47 EST 2021

Sequence seems to be:

   - node02 is DC and master/primary, node01 is maintenance mode and
   slave/secondary
   - comms go down
   - node01 elects itself master, and deletes node01 status from its cib
   - comms come up
   - cluster starts reforming
   - node01 sends cib updates to node02
   - DC negotiations start, both nodes unset DC
   - node02 receives cib updates and process them, deleting its own status
   - DC negotiations complete with node02 winning
   - node02, having lost it's status, believes it cannot host resources and
   stops them all
   - for whatever reason, perhaps somehow due to the completely missing
   transient_attributes, node02 nevers schedules a probe for itself
   - we have to "refresh" manually

On Mon, Feb 1, 2021 at 11:31 AM Ken Gaillot <kgaillot at redhat.com> wrote:

> On Mon, 2021-02-01 at 11:09 -0500, Stuart Massey wrote:
> > Hi Ken,
> > Thanks. In this case, transient_attributes for node02 in the cib on
> > node02 which never lost quorum seem to be deleted by a request from
> > node01 when node01 rejoins the cluster - IF I understand the
> > pacemaker.log correctly. This causes node02 to stop resources, which
> > will not be restarted until we manually refresh on node02.
>
> Good point, it depends on which node is DC. When a cluster splits, each
> side sees the other side as the one that left. When the split heals,
> whichever side has the newly elected DC is the one that clears the
> other.
>
> However the DC should schedule probes for the other side, and probes
> generally set the promotion score, so manual intervention shouldn't be
> needed. I'd make sure that probes were scheduled, then investigate how
> the agent sets the score.
>
> > On Mon, Feb 1, 2021 at 10:59 AM Ken Gaillot <kgaillot at redhat.com>
> > wrote:
> > > On Fri, 2021-01-29 at 12:37 -0500, Stuart Massey wrote:
> > > > Can someone help me with this?
> > > > Background:
> > > > > "node01" is failing, and has been placed in "maintenance" mode.
> > > It
> > > > > occasionally loses connectivity.
> > > > > "node02" is able to run our resources
> > > >
> > > > Consider the following messages from pacemaker.log on "node02",
> > > just
> > > > after "node01" has rejoined the cluster (per "node02"):
> > > > > Jan 28 14:48:03 [21933] node02.example.com        cib:
> > >  info:
> > > > > cib_perform_op:       --
> > > > > /cib/status/node_state[@id='2']/transient_attributes[@id='2']
> > > > > Jan 28 14:48:03 [21933] node02.example.com        cib:
> > >  info:
> > > > > cib_perform_op:       +  /cib:  @num_updates=309
> > > > > Jan 28 14:48:03 [21933] node02.example.com        cib:
> > >  info:
> > > > > cib_process_request:  Completed cib_delete operation for
> > > section
> > > > > //node_state[@uname='node02.example.com']/transient_attributes:
> > > OK
> > > > > (rc=0, origin=node01.example.com/crmd/3784, version=0.94.309)
> > > > > Jan 28 14:48:04 [21938] node02.example.com       crmd:
> > >  info:
> > > > > abort_transition_graph:       Transition aborted by deletion of
> > > > > transient_attributes[@id='2']: Transient attribute change |
> > > > > cib=0.94.309 source=abort_unless_down:357
> > > > >
> > > path=/cib/status/node_state[@id='2']/transient_attributes[@id='2']
> > > > > complete=true
> > > > > Jan 28 14:48:05 [21937] node02.example.com    pengine:
> > >  info:
> > > > > master_color: ms_drbd_ourApp: Promoted 0 instances of a
> > > possible 1
> > > > > to master
> > > > >
> > > > The implication, it seems to me, is that "node01" has asked
> > > "node02"
> > > > to delete the transient-attributes for "node02". The transient-
> > > > attributes should normally be:
> > > >       <transient_attributes id="2">
> > > >         <instance_attributes id="status-2">
> > > >           <nvpair id="status-2-master-drbd_ourApp" name="master-
> > > > drbd_ourApp" value="10000"/>
> > > >           <nvpair id="status-2-pingd" name="pingd" value="100"/>
> > > >         </instance_attributes>
> > > >       </transient_attributes>
> > > >
> > > > These attributes are necessary for "node02" to be Master/Primary,
> > > > correct?
> > > >
> > > > Why might this be happening and how do we prevent it?
> > >
> > > Transient attributes are always cleared when a node leaves the
> > > cluster
> > > (that's what makes them transient ...). It's probably coincidence
> > > it
> > > went through as the node rejoined.
> > >
> > > When the node rejoins, it will trigger another run of the
> > > scheduler,
> > > which will schedule a probe of all resources on the node. Those
> > > probes
> > > should reset the promotion score.
> > > _______________________________________________
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > ClusterLabs home: https://www.clusterlabs.org/
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20210201/f77ee988/attachment-0001.htm>