[ClusterLabs] Peer (slave) node deleting master's transient_attributes

Sun Jan 31 01:55:31 EST 2021

29.01.2021 20:37, Stuart Massey пишет:
> Can someone help me with this?
> Background:
> 
> "node01" is failing, and has been placed in "maintenance" mode. It
> occasionally loses connectivity.
> 
> "node02" is able to run our resources
> 
> Consider the following messages from pacemaker.log on "node02", just after
> "node01" has rejoined the cluster (per "node02"):
> 
> Jan 28 14:48:03 [21933] node02.example.com        cib:     info:
> cib_perform_op:       --
> /cib/status/node_state[@id='2']/transient_attributes[@id='2']
> Jan 28 14:48:03 [21933] node02.example.com        cib:     info:
> cib_perform_op:       +  /cib:  @num_updates=309
> Jan 28 14:48:03 [21933] node02.example.com        cib:     info:
> cib_process_request:  Completed cib_delete operation for section
> //node_state[@uname='node02.example.com']/transient_attributes: OK (rc=0,
> origin=node01.example.com/crmd/3784, version=0.94.309)
> Jan 28 14:48:04 [21938] node02.example.com       crmd:     info:
> abort_transition_graph:       Transition aborted by deletion of
> transient_attributes[@id='2']: Transient attribute change | cib=0.94.309
> source=abort_unless_down:357
> path=/cib/status/node_state[@id='2']/transient_attributes[@id='2']
> complete=true
> Jan 28 14:48:05 [21937] node02.example.com    pengine:     info:
> master_color: ms_drbd_ourApp: Promoted 0 instances of a possible 1 to master
> 
> The implication, it seems to me, is that "node01" has asked "node02" to
> delete the transient-attributes for "node02". The transient-attributes
> should normally be:
>       <transient_attributes id="2">
>         <instance_attributes id="status-2">
>           <nvpair id="status-2-master-drbd_ourApp"
> name="master-drbd_ourApp" value="10000"/>
>           <nvpair id="status-2-pingd" name="pingd" value="100"/>
>         </instance_attributes>
>       </transient_attributes>
> 
> These attributes are necessary for "node02" to be Master/Primary, correct?
> 
> Why might this be happening and how do we prevent it?
> 

You do not provide enough information to answer. At the very least you
need to show full logs from both nodes around time it happens (starting
with both nodes losing connectivity).

But as a wild guess - you do not use stonith, node01 becomes DC and
clears other node state.