[ClusterLabs] Antw: [EXT] Re: Peer (slave) node deleting master's transient_attributes

Stuart Massey djangoschef at gmail.com
Tue Feb 2 08:02:26 EST 2021


A reasonable question from a practical on-the-ground perspective. Several
considerations contributed:

   1. We need to induce the hardware errors to demonstrate the problem to
   the hardware vendor, who is acting at their own pace. Leaving the node in
   it's semi-active state seems to achieve this goal.
   2. We would like to retain as much as possible an up-to-date secondary
   of mission critical data. Leaving drbd up in secondary mode under HA
   control was the most direct way to do this.
   3. We have had occasional fail-overs over time (prior to current
   hardware issues) which we have had difficulty fully explaining. We chalked
   them up to transient communications errors with sequences and effects we
   didn't fully comprehend, but have remained uneasy about our particular HA
   configuration and about crm/pacemaker/drbd generally. This seems like a
   good opportunity to try and "catch" and/or fully understand this issue.

Regards

On Tue, Feb 2, 2021 at 2:11 AM Ulrich Windl <
Ulrich.Windl at rz.uni-regensburg.de> wrote:

> >>> Ken Gaillot <kgaillot at redhat.com> schrieb am 01.02.2021 um 17:27 in
> Nachricht
> <9b99d08faf4ddbe496ede10165f586afd81aa850.camel at redhat.com>:
> > On Mon, 2021-02-01 at 11:16 -0500, Stuart Massey wrote:
> >> Andrei,
> >> You are right, thank you. I have an earlier thread on which I posted
> >> a pacemaker.log for this issue, and didn't think to point to it here.
> >> The link is
> >> http://project.ibss.net/samples/deidPacemakerLog.2021-01-25.txtxt .
> >> So, node01 is in maintenance mode, and constraints prevent any
> >> resources from running on it (other than drbd in Secondary). I would
> >> not want node01 to ston[node02]ith after a communications failure,
> >> especially not if all resources are running fine on node02.
> >> Also I did not think to wonder if node01 could become DC even though
> >> in maintenance mode.
> >> The logs seem to me to match this contention. The cib ops happen
> >> right in the middle of the DC negotiations.
> >> Is there a way to tell node01 that it cannot be DC? Like a
> >> constraint?
> >
> > No, though that's been suggested as a new feature.
>
> I wonder: If the node is running no resources, the node is in
> maintenance-mode, and the node shouldn't be DC wouldn't it be easiest to
> cleanly shutdown the node? What would be the difference? Or asked the other
> way: What's the purpose of such scenario?
>
> >
> > As a workaround, you could restart the cluster on the less preferred
> > node -- the controller with the most CPU time (i.e. up the longest)
> > will be preferred for DC (if pacemaker versions are equal).
> >
> >> Thanks again.
> >>
> >>
> >>
> >> On Sun, Jan 31, 2021 at 1:55 AM Andrei Borzenkov <arvidjaar at gmail.com
> >> > wrote:
> >> > 29.01.2021 20:37, Stuart Massey пишет:
> >> > > Can someone help me with this?
> >> > > Background:
> >> > >
> >> > > "node01" is failing, and has been placed in "maintenance" mode.
> >> > It
> >> > > occasionally loses connectivity.
> >> > >
> >> > > "node02" is able to run our resources
> >> > >
> >> > > Consider the following messages from pacemaker.log on "node02",
> >> > just after
> >> > > "node01" has rejoined the cluster (per "node02"):
> >> > >
> >> > > Jan 28 14:48:03 [21933] node02.example.com        cib:     info:
> >> > > cib_perform_op:       --
> >> > > /cib/status/node_state[@id='2']/transient_attributes[@id='2']
> >> > > Jan 28 14:48:03 [21933] node02.example.com        cib:     info:
> >> > > cib_perform_op:       +  /cib:  @num_updates=309
> >> > > Jan 28 14:48:03 [21933] node02.example.com        cib:     info:
> >> > > cib_process_request:  Completed cib_delete operation for section
> >> > > //node_state[@uname='node02.example.com']/transient_attributes:
> >> > OK (rc=0,
> >> > > origin=node01.example.com/crmd/3784, version=0.94.309)
> >> > > Jan 28 14:48:04 [21938] node02.example.com       crmd:     info:
> >> > > abort_transition_graph:       Transition aborted by deletion of
> >> > > transient_attributes[@id='2']: Transient attribute change |
> >> > cib=0.94.309
> >> > > source=abort_unless_down:357
> >> > >
> >> > path=/cib/status/node_state[@id='2']/transient_attributes[@id='2']
> >> > > complete=true
> >> > > Jan 28 14:48:05 [21937] node02.example.com    pengine:     info:
> >> > > master_color: ms_drbd_ourApp: Promoted 0 instances of a possible
> >> > 1 to master
> >> > >
> >> > > The implication, it seems to me, is that "node01" has asked
> >> > "node02" to
> >> > > delete the transient-attributes for "node02". The transient-
> >> > attributes
> >> > > should normally be:
> >> > >       <transient_attributes id="2">
> >> > >         <instance_attributes id="status-2">
> >> > >           <nvpair id="status-2-master-drbd_ourApp"
> >> > > name="master-drbd_ourApp" value="10000"/>
> >> > >           <nvpair id="status-2-pingd" name="pingd" value="100"/>
> >> > >         </instance_attributes>
> >> > >       </transient_attributes>
> >> > >
> >> > > These attributes are necessary for "node02" to be Master/Primary,
> >> > correct?
> >> > >
> >> > > Why might this be happening and how do we prevent it?
> >> > >
> >> >
> >> > You do not provide enough information to answer. At the very least
> >> > you
> >> > need to show full logs from both nodes around time it happens
> >> > (starting
> >> > with both nodes losing connectivity).
> >> >
> >> > But as a wild guess - you do not use stonith, node01 becomes DC and
> >> > clears other node state.
> >> > _______________________________________________
> >> > Manage your subscription:
> >> > https://lists.clusterlabs.org/mailman/listinfo/users
> >> >
> >> > ClusterLabs home: https://www.clusterlabs.org/
> >>
> >> _______________________________________________
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> > --
> > Ken Gaillot <kgaillot at redhat.com>
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20210202/4f8668f0/attachment-0001.htm>


More information about the Users mailing list