[ClusterLabs] Peer (slave) node deleting master's transient_attributes

Stuart Massey djangoschef at gmail.com
Wed Feb 17 14:36:59 EST 2021


Ken, very much appreciate your help on this - I am wondering what you might
have gleaned from the logs.
Thanks!

On Mon, Feb 8, 2021 at 2:43 PM Stuart Massey <djangoschef at gmail.com> wrote:

> Wonderful, thank you for looking at this!
> I have posted uncompressed "saving inputs" files at the links below - 3241
> is the immediately preceding one that exists, and 3242 is the one created
> upon encountering the problem state. In both cases, it looks to me like
> node02 is DC. There are none of these on node01 for the intervening time
> period. I've also posted a patch diff of the two with xml formatted for one
> attribute per line, and am reiterating the link to the related
> pacemaker.log extract.
>
> https://project.ibss.net/samples/pe-input-3242.txt (upon encountering the
> problem demotion)
> https://project.ibss.net/samples/pe-input-3241.txt (most recent previous
> pe-input-*)
> https://project.ibss.net/samples/pe-input-diff.txt
> https://project.ibss.net/samples/deidPacemakerLog.2021-01-25.txt
>
> Thank you,
> Stuart
>
>
>
> On Mon, Feb 8, 2021 at 12:36 PM Ken Gaillot <kgaillot at redhat.com> wrote:
>
>> On Mon, 2021-02-08 at 12:01 -0500, Stuart Massey wrote:
>> > I'm wondering if anyone can advise us on next steps here and/or
>> > correct our understanding. This seems like a race condition that
>> > causes resources to be stopped unnecessarily. Is there a way to
>> > prevent a node from processing cib updates from a peer while DC
>> > negotiations are underway? Our "node02" is running resources fine,
>>
>> It shouldn't be necessary -- when node02 becomes DC, it shouldn't see
>> itself as unable to run resources, it should probe the current state of
>> everything, and then come to the right conclusion.
>>
>> If you look in the detail log, there should be "saving inputs" messages
>> on the DC at any given time, with a file name. If you can attach the
>> file from when node02 first becomes DC, I can check whether probes are
>> being scheduled.
>>
>> > and since it winds up winning the DC election, would continue to run
>> > them uninterrupted if it just ignored or pended the cib updates it
>> > receives in the middle of the negotiation.
>> > Very much appreciate all the help and discussion available on this
>> > board.
>> > Regards,
>> > Stuart
>> >
>> > On Mon, Feb 1, 2021 at 11:43 AM Stuart Massey <djangoschef at gmail.com>
>> > wrote:
>> > > Sequence seems to be:
>> > > node02 is DC and master/primary, node01 is maintenance mode and
>> > > slave/secondary
>> > > comms go down
>> > > node01 elects itself master, and deletes node01 status from its cib
>> > > comms come up
>> > > cluster starts reforming
>> > > node01 sends cib updates to node02
>> > > DC negotiations start, both nodes unset DC
>> > > node02 receives cib updates and process them, deleting its own
>> > > status
>> > > DC negotiations complete with node02 winning
>> > > node02, having lost it's status, believes it cannot host resources
>> > > and stops them all
>> > > for whatever reason, perhaps somehow due to the completely missing
>> > > transient_attributes, node02 nevers schedules a probe for itself
>> > > we have to "refresh" manually
>> > >
>> > > On Mon, Feb 1, 2021 at 11:31 AM Ken Gaillot <kgaillot at redhat.com>
>> > > wrote:
>> > > > On Mon, 2021-02-01 at 11:09 -0500, Stuart Massey wrote:
>> > > > > Hi Ken,
>> > > > > Thanks. In this case, transient_attributes for node02 in the
>> > > > cib on
>> > > > > node02 which never lost quorum seem to be deleted by a request
>> > > > from
>> > > > > node01 when node01 rejoins the cluster - IF I understand the
>> > > > > pacemaker.log correctly. This causes node02 to stop resources,
>> > > > which
>> > > > > will not be restarted until we manually refresh on node02.
>> > > >
>> > > > Good point, it depends on which node is DC. When a cluster
>> > > > splits, each
>> > > > side sees the other side as the one that left. When the split
>> > > > heals,
>> > > > whichever side has the newly elected DC is the one that clears
>> > > > the
>> > > > other.
>> > > >
>> > > > However the DC should schedule probes for the other side, and
>> > > > probes
>> > > > generally set the promotion score, so manual intervention
>> > > > shouldn't be
>> > > > needed. I'd make sure that probes were scheduled, then
>> > > > investigate how
>> > > > the agent sets the score.
>> > > >
>> > > > > On Mon, Feb 1, 2021 at 10:59 AM Ken Gaillot <
>> > > > kgaillot at redhat.com>
>> > > > > wrote:
>> > > > > > On Fri, 2021-01-29 at 12:37 -0500, Stuart Massey wrote:
>> > > > > > > Can someone help me with this?
>> > > > > > > Background:
>> > > > > > > > "node01" is failing, and has been placed in "maintenance"
>> > > > mode.
>> > > > > > It
>> > > > > > > > occasionally loses connectivity.
>> > > > > > > > "node02" is able to run our resources
>> > > > > > >
>> > > > > > > Consider the following messages from pacemaker.log on
>> > > > "node02",
>> > > > > > just
>> > > > > > > after "node01" has rejoined the cluster (per "node02"):
>> > > > > > > > Jan 28 14:48:03 [21933] node02.example.com        cib:
>> > > > > >  info:
>> > > > > > > > cib_perform_op:       --
>> > > > > > > >
>> > > > /cib/status/node_state[@id='2']/transient_attributes[@id='2']
>> > > > > > > > Jan 28 14:48:03 [21933] node02.example.com        cib:
>> > > > > >  info:
>> > > > > > > > cib_perform_op:       +  /cib:  @num_updates=309
>> > > > > > > > Jan 28 14:48:03 [21933] node02.example.com        cib:
>> > > > > >  info:
>> > > > > > > > cib_process_request:  Completed cib_delete operation for
>> > > > > > section
>> > > > > > > >
>> > > > //node_state[@uname='node02.example.com']/transient_attributes:
>> > > > > > OK
>> > > > > > > > (rc=0, origin=node01.example.com/crmd/3784,
>> > > > version=0.94.309)
>> > > > > > > > Jan 28 14:48:04 [21938] node02.example.com       crmd:
>> > > > > >  info:
>> > > > > > > > abort_transition_graph:       Transition aborted by
>> > > > deletion of
>> > > > > > > > transient_attributes[@id='2']: Transient attribute change
>> > > > |
>> > > > > > > > cib=0.94.309 source=abort_unless_down:357
>> > > > > > > >
>> > > > > >
>> > > > path=/cib/status/node_state[@id='2']/transient_attributes[@id='2'
>> > > > ]
>> > > > > > > > complete=true
>> > > > > > > > Jan 28 14:48:05 [21937] node02.example.com    pengine:
>> > > > > >  info:
>> > > > > > > > master_color: ms_drbd_ourApp: Promoted 0 instances of a
>> > > > > > possible 1
>> > > > > > > > to master
>> > > > > > > >
>> > > > > > > The implication, it seems to me, is that "node01" has asked
>> > > > > > "node02"
>> > > > > > > to delete the transient-attributes for "node02". The
>> > > > transient-
>> > > > > > > attributes should normally be:
>> > > > > > >       <transient_attributes id="2">
>> > > > > > >         <instance_attributes id="status-2">
>> > > > > > >           <nvpair id="status-2-master-drbd_ourApp"
>> > > > name="master-
>> > > > > > > drbd_ourApp" value="10000"/>
>> > > > > > >           <nvpair id="status-2-pingd" name="pingd"
>> > > > value="100"/>
>> > > > > > >         </instance_attributes>
>> > > > > > >       </transient_attributes>
>> > > > > > >
>> > > > > > > These attributes are necessary for "node02" to be
>> > > > Master/Primary,
>> > > > > > > correct?
>> > > > > > >
>> > > > > > > Why might this be happening and how do we prevent it?
>> > > > > >
>> > > > > > Transient attributes are always cleared when a node leaves
>> > > > the
>> > > > > > cluster
>> > > > > > (that's what makes them transient ...). It's probably
>> > > > coincidence
>> > > > > > it
>> > > > > > went through as the node rejoined.
>> > > > > >
>> > > > > > When the node rejoins, it will trigger another run of the
>> > > > > > scheduler,
>> > > > > > which will schedule a probe of all resources on the node.
>> > > > Those
>> > > > > > probes
>> > > > > > should reset the promotion score.
>> > > > > > _______________________________________________
>> > > > > > Manage your subscription:
>> > > > > > https://lists.clusterlabs.org/mailman/listinfo/users
>> > > > > >
>> > > > > > ClusterLabs home: https://www.clusterlabs.org/
>> > > > _______________________________________________
>> > > > Manage your subscription:
>> > > > https://lists.clusterlabs.org/mailman/listinfo/users
>> > > >
>> > > > ClusterLabs home: https://www.clusterlabs.org/
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20210217/fb9c6383/attachment-0001.htm>


More information about the Users mailing list