[ClusterLabs] Antw: [EXT] Re: Peer (slave) node deleting master's transient_attributes

Tue Feb 2 02:11:32 EST 2021

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 01.02.2021 um 17:27 in
Nachricht
<9b99d08faf4ddbe496ede10165f586afd81aa850.camel at redhat.com>:
> On Mon, 2021-02-01 at 11:16 -0500, Stuart Massey wrote:
>> Andrei,
>> You are right, thank you. I have an earlier thread on which I posted
>> a pacemaker.log for this issue, and didn't think to point to it here.
>> The link is 
>> http://project.ibss.net/samples/deidPacemakerLog.2021-01-25.txtxt .
>> So, node01 is in maintenance mode, and constraints prevent any
>> resources from running on it (other than drbd in Secondary). I would
>> not want node01 to ston[node02]ith after a communications failure,
>> especially not if all resources are running fine on node02.
>> Also I did not think to wonder if node01 could become DC even though
>> in maintenance mode.
>> The logs seem to me to match this contention. The cib ops happen
>> right in the middle of the DC negotiations.
>> Is there a way to tell node01 that it cannot be DC? Like a
>> constraint?
> 
> No, though that's been suggested as a new feature.

I wonder: If the node is running no resources, the node is in
maintenance-mode, and the node shouldn't be DC wouldn't it be easiest to
cleanly shutdown the node? What would be the difference? Or asked the other
way: What's the purpose of such scenario?

> 
> As a workaround, you could restart the cluster on the less preferred
> node -- the controller with the most CPU time (i.e. up the longest)
> will be preferred for DC (if pacemaker versions are equal).
> 
>> Thanks again.
>> 
>> 
>> 
>> On Sun, Jan 31, 2021 at 1:55 AM Andrei Borzenkov <arvidjaar at gmail.com 
>> > wrote:
>> > 29.01.2021 20:37, Stuart Massey пишет:
>> > > Can someone help me with this?
>> > > Background:
>> > > 
>> > > "node01" is failing, and has been placed in "maintenance" mode.
>> > It
>> > > occasionally loses connectivity.
>> > > 
>> > > "node02" is able to run our resources
>> > > 
>> > > Consider the following messages from pacemaker.log on "node02",
>> > just after
>> > > "node01" has rejoined the cluster (per "node02"):
>> > > 
>> > > Jan 28 14:48:03 [21933] node02.example.com        cib:     info:
>> > > cib_perform_op:       --
>> > > /cib/status/node_state[@id='2']/transient_attributes[@id='2']
>> > > Jan 28 14:48:03 [21933] node02.example.com        cib:     info:
>> > > cib_perform_op:       +  /cib:  @num_updates=309
>> > > Jan 28 14:48:03 [21933] node02.example.com        cib:     info:
>> > > cib_process_request:  Completed cib_delete operation for section
>> > > //node_state[@uname='node02.example.com']/transient_attributes:
>> > OK (rc=0,
>> > > origin=node01.example.com/crmd/3784, version=0.94.309)
>> > > Jan 28 14:48:04 [21938] node02.example.com       crmd:     info:
>> > > abort_transition_graph:       Transition aborted by deletion of
>> > > transient_attributes[@id='2']: Transient attribute change |
>> > cib=0.94.309
>> > > source=abort_unless_down:357
>> > >
>> > path=/cib/status/node_state[@id='2']/transient_attributes[@id='2']
>> > > complete=true
>> > > Jan 28 14:48:05 [21937] node02.example.com    pengine:     info:
>> > > master_color: ms_drbd_ourApp: Promoted 0 instances of a possible
>> > 1 to master
>> > > 
>> > > The implication, it seems to me, is that "node01" has asked
>> > "node02" to
>> > > delete the transient-attributes for "node02". The transient-
>> > attributes
>> > > should normally be:
>> > >       <transient_attributes id="2">
>> > >         <instance_attributes id="status-2">
>> > >           <nvpair id="status-2-master-drbd_ourApp"
>> > > name="master-drbd_ourApp" value="10000"/>
>> > >           <nvpair id="status-2-pingd" name="pingd" value="100"/>
>> > >         </instance_attributes>
>> > >       </transient_attributes>
>> > > 
>> > > These attributes are necessary for "node02" to be Master/Primary,
>> > correct?
>> > > 
>> > > Why might this be happening and how do we prevent it?
>> > > 
>> > 
>> > You do not provide enough information to answer. At the very least
>> > you
>> > need to show full logs from both nodes around time it happens
>> > (starting
>> > with both nodes losing connectivity).
>> > 
>> > But as a wild guess - you do not use stonith, node01 becomes DC and
>> > clears other node state.
>> > _______________________________________________
>> > Manage your subscription:
>> > https://lists.clusterlabs.org/mailman/listinfo/users 
>> > 
>> > ClusterLabs home: https://www.clusterlabs.org/ 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
> -- 
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/