[ClusterLabs] Peer (slave) node deleting master's transient_attributes

Ken Gaillot kgaillot at redhat.com
Mon Feb 1 11:31:41 EST 2021


On Mon, 2021-02-01 at 11:09 -0500, Stuart Massey wrote:
> Hi Ken,
> Thanks. In this case, transient_attributes for node02 in the cib on
> node02 which never lost quorum seem to be deleted by a request from
> node01 when node01 rejoins the cluster - IF I understand the
> pacemaker.log correctly. This causes node02 to stop resources, which
> will not be restarted until we manually refresh on node02.

Good point, it depends on which node is DC. When a cluster splits, each
side sees the other side as the one that left. When the split heals,
whichever side has the newly elected DC is the one that clears the
other.

However the DC should schedule probes for the other side, and probes
generally set the promotion score, so manual intervention shouldn't be
needed. I'd make sure that probes were scheduled, then investigate how
the agent sets the score.

> On Mon, Feb 1, 2021 at 10:59 AM Ken Gaillot <kgaillot at redhat.com>
> wrote:
> > On Fri, 2021-01-29 at 12:37 -0500, Stuart Massey wrote:
> > > Can someone help me with this?
> > > Background:
> > > > "node01" is failing, and has been placed in "maintenance" mode.
> > It
> > > > occasionally loses connectivity.
> > > > "node02" is able to run our resources
> > > 
> > > Consider the following messages from pacemaker.log on "node02",
> > just
> > > after "node01" has rejoined the cluster (per "node02"):
> > > > Jan 28 14:48:03 [21933] node02.example.com        cib:   
> >  info:
> > > > cib_perform_op:       --
> > > > /cib/status/node_state[@id='2']/transient_attributes[@id='2']
> > > > Jan 28 14:48:03 [21933] node02.example.com        cib:   
> >  info:
> > > > cib_perform_op:       +  /cib:  @num_updates=309
> > > > Jan 28 14:48:03 [21933] node02.example.com        cib:   
> >  info:
> > > > cib_process_request:  Completed cib_delete operation for
> > section
> > > > //node_state[@uname='node02.example.com']/transient_attributes:
> > OK
> > > > (rc=0, origin=node01.example.com/crmd/3784, version=0.94.309)
> > > > Jan 28 14:48:04 [21938] node02.example.com       crmd:   
> >  info:
> > > > abort_transition_graph:       Transition aborted by deletion of
> > > > transient_attributes[@id='2']: Transient attribute change |
> > > > cib=0.94.309 source=abort_unless_down:357
> > > >
> > path=/cib/status/node_state[@id='2']/transient_attributes[@id='2']
> > > > complete=true
> > > > Jan 28 14:48:05 [21937] node02.example.com    pengine:   
> >  info:
> > > > master_color: ms_drbd_ourApp: Promoted 0 instances of a
> > possible 1
> > > > to master
> > > > 
> > > The implication, it seems to me, is that "node01" has asked
> > "node02"
> > > to delete the transient-attributes for "node02". The transient-
> > > attributes should normally be:
> > >       <transient_attributes id="2">
> > >         <instance_attributes id="status-2">
> > >           <nvpair id="status-2-master-drbd_ourApp" name="master-
> > > drbd_ourApp" value="10000"/>
> > >           <nvpair id="status-2-pingd" name="pingd" value="100"/>
> > >         </instance_attributes>
> > >       </transient_attributes>
> > > 
> > > These attributes are necessary for "node02" to be Master/Primary,
> > > correct? 
> > > 
> > > Why might this be happening and how do we prevent it?
> > 
> > Transient attributes are always cleared when a node leaves the
> > cluster
> > (that's what makes them transient ...). It's probably coincidence
> > it
> > went through as the node rejoined.
> > 
> > When the node rejoins, it will trigger another run of the
> > scheduler,
> > which will schedule a probe of all resources on the node. Those
> > probes
> > should reset the promotion score.
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list