[ClusterLabs] Peer (slave) node deleting master's transient_attributes
Ken Gaillot
kgaillot at redhat.com
Mon Feb 1 10:58:48 EST 2021
On Fri, 2021-01-29 at 12:37 -0500, Stuart Massey wrote:
> Can someone help me with this?
> Background:
> > "node01" is failing, and has been placed in "maintenance" mode. It
> > occasionally loses connectivity.
> > "node02" is able to run our resources
>
> Consider the following messages from pacemaker.log on "node02", just
> after "node01" has rejoined the cluster (per "node02"):
> > Jan 28 14:48:03 [21933] node02.example.com cib: info:
> > cib_perform_op: --
> > /cib/status/node_state[@id='2']/transient_attributes[@id='2']
> > Jan 28 14:48:03 [21933] node02.example.com cib: info:
> > cib_perform_op: + /cib: @num_updates=309
> > Jan 28 14:48:03 [21933] node02.example.com cib: info:
> > cib_process_request: Completed cib_delete operation for section
> > //node_state[@uname='node02.example.com']/transient_attributes: OK
> > (rc=0, origin=node01.example.com/crmd/3784, version=0.94.309)
> > Jan 28 14:48:04 [21938] node02.example.com crmd: info:
> > abort_transition_graph: Transition aborted by deletion of
> > transient_attributes[@id='2']: Transient attribute change |
> > cib=0.94.309 source=abort_unless_down:357
> > path=/cib/status/node_state[@id='2']/transient_attributes[@id='2']
> > complete=true
> > Jan 28 14:48:05 [21937] node02.example.com pengine: info:
> > master_color: ms_drbd_ourApp: Promoted 0 instances of a possible 1
> > to master
> >
> The implication, it seems to me, is that "node01" has asked "node02"
> to delete the transient-attributes for "node02". The transient-
> attributes should normally be:
> <transient_attributes id="2">
> <instance_attributes id="status-2">
> <nvpair id="status-2-master-drbd_ourApp" name="master-
> drbd_ourApp" value="10000"/>
> <nvpair id="status-2-pingd" name="pingd" value="100"/>
> </instance_attributes>
> </transient_attributes>
>
> These attributes are necessary for "node02" to be Master/Primary,
> correct?
>
> Why might this be happening and how do we prevent it?
Transient attributes are always cleared when a node leaves the cluster
(that's what makes them transient ...). It's probably coincidence it
went through as the node rejoined.
When the node rejoins, it will trigger another run of the scheduler,
which will schedule a probe of all resources on the node. Those probes
should reset the promotion score.
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list