[ClusterLabs] Peer (slave) node deleting master's transient_attributes
Andrei Borzenkov
arvidjaar at gmail.com
Sun Jan 31 01:55:31 EST 2021
29.01.2021 20:37, Stuart Massey пишет:
> Can someone help me with this?
> Background:
>
> "node01" is failing, and has been placed in "maintenance" mode. It
> occasionally loses connectivity.
>
> "node02" is able to run our resources
>
> Consider the following messages from pacemaker.log on "node02", just after
> "node01" has rejoined the cluster (per "node02"):
>
> Jan 28 14:48:03 [21933] node02.example.com cib: info:
> cib_perform_op: --
> /cib/status/node_state[@id='2']/transient_attributes[@id='2']
> Jan 28 14:48:03 [21933] node02.example.com cib: info:
> cib_perform_op: + /cib: @num_updates=309
> Jan 28 14:48:03 [21933] node02.example.com cib: info:
> cib_process_request: Completed cib_delete operation for section
> //node_state[@uname='node02.example.com']/transient_attributes: OK (rc=0,
> origin=node01.example.com/crmd/3784, version=0.94.309)
> Jan 28 14:48:04 [21938] node02.example.com crmd: info:
> abort_transition_graph: Transition aborted by deletion of
> transient_attributes[@id='2']: Transient attribute change | cib=0.94.309
> source=abort_unless_down:357
> path=/cib/status/node_state[@id='2']/transient_attributes[@id='2']
> complete=true
> Jan 28 14:48:05 [21937] node02.example.com pengine: info:
> master_color: ms_drbd_ourApp: Promoted 0 instances of a possible 1 to master
>
> The implication, it seems to me, is that "node01" has asked "node02" to
> delete the transient-attributes for "node02". The transient-attributes
> should normally be:
> <transient_attributes id="2">
> <instance_attributes id="status-2">
> <nvpair id="status-2-master-drbd_ourApp"
> name="master-drbd_ourApp" value="10000"/>
> <nvpair id="status-2-pingd" name="pingd" value="100"/>
> </instance_attributes>
> </transient_attributes>
>
> These attributes are necessary for "node02" to be Master/Primary, correct?
>
> Why might this be happening and how do we prevent it?
>
You do not provide enough information to answer. At the very least you
need to show full logs from both nodes around time it happens (starting
with both nodes losing connectivity).
But as a wild guess - you do not use stonith, node01 becomes DC and
clears other node state.
More information about the Users
mailing list