[ClusterLabs] Peer (slave) node deleting master's transient_attributes

Mon Feb 1 11:01:23 EST 2021

On Mon, 2021-02-01 at 09:58 -0600, Ken Gaillot wrote:
> On Fri, 2021-01-29 at 12:37 -0500, Stuart Massey wrote:
> > Can someone help me with this?
> > Background:
> > > "node01" is failing, and has been placed in "maintenance" mode.
> > > It
> > > occasionally loses connectivity.
> > > "node02" is able to run our resources
> > 
> > Consider the following messages from pacemaker.log on "node02",
> > just
> > after "node01" has rejoined the cluster (per "node02"):
> > > Jan 28 14:48:03 [21933] node02.example.com        cib:     info:
> > > cib_perform_op:       --
> > > /cib/status/node_state[@id='2']/transient_attributes[@id='2']
> > > Jan 28 14:48:03 [21933] node02.example.com        cib:     info:
> > > cib_perform_op:       +  /cib:  @num_updates=309
> > > Jan 28 14:48:03 [21933] node02.example.com        cib:     info:
> > > cib_process_request:  Completed cib_delete operation for section
> > > //node_state[@uname='node02.example.com']/transient_attributes:
> > > OK
> > > (rc=0, origin=node01.example.com/crmd/3784, version=0.94.309)
> > > Jan 28 14:48:04 [21938] node02.example.com       crmd:     info:
> > > abort_transition_graph:       Transition aborted by deletion of
> > > transient_attributes[@id='2']: Transient attribute change |
> > > cib=0.94.309 source=abort_unless_down:357
> > > path=/cib/status/node_state[@id='2']/transient_attributes[@id='2'
> > > ]
> > > complete=true
> > > Jan 28 14:48:05 [21937] node02.example.com    pengine:     info:
> > > master_color: ms_drbd_ourApp: Promoted 0 instances of a possible
> > > 1
> > > to master
> > > 
> > 
> > The implication, it seems to me, is that "node01" has asked
> > "node02"
> > to delete the transient-attributes for "node02". The transient-
> > attributes should normally be:
> >       <transient_attributes id="2">
> >         <instance_attributes id="status-2">
> >           <nvpair id="status-2-master-drbd_ourApp" name="master-
> > drbd_ourApp" value="10000"/>
> >           <nvpair id="status-2-pingd" name="pingd" value="100"/>
> >         </instance_attributes>
> >       </transient_attributes>
> > 
> > These attributes are necessary for "node02" to be Master/Primary,
> > correct? 
> > 
> > Why might this be happening and how do we prevent it?
> 
> Transient attributes are always cleared when a node leaves the
> cluster
> (that's what makes them transient ...). It's probably coincidence it
> went through as the node rejoined.
> 
> When the node rejoins, it will trigger another run of the scheduler,
> which will schedule a probe of all resources on the node. Those
> probes
> should reset the promotion score.

To elaborate a bit, it's actually up to the resource agent and/or user
how to set the promotion score, but most agents do it in the monitor
(including probes). It's possible to set them manually with crm_master,
and to set them as permanent attributes rather than transient, but
letting the agent set them, as transient, is generally better.
-- 
Ken Gaillot <kgaillot at redhat.com>