[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] DRBD ms resource keeps getting demoted

Tue Jan 26 14:01:53 EST 2021

*** HELP ***
Our healthy Primary/Master demoted itself again. This time it did not
re-promote anything until we "refresh"-ed the ms drbd resource.
Note that the failing Slave/Secondary node is in maintenance mode, as it
has been for several days now.
I have posted the pacemaker.log here:
http://project.ibss.net/samples/deidPacemakeLog.2021-01-25.txt
Any insight anyone could offer would be very much appreciated!

On Mon, Jan 25, 2021 at 8:04 AM Stuart Massey <djangoschef at gmail.com> wrote:

> Ok, that is exactly what one might expect -- and: Note that only the
> failing node is in maintenance mode. The current master/primary is not in
> maintenance mode, and on that node we continue to see messages in
> pacemaker.log that seem to indicate that it is doing monitor operations.
> Logically, if one has a multi-node cluster and puts only one of the nodes
> in maintenance mode while there are no managed resources running on it,
> wouldn't the other nodes continue to manage the resources among themselves?
>
> On Mon, Jan 25, 2021 at 2:07 AM Ulrich Windl <
> Ulrich.Windl at rz.uni-regensburg.de> wrote:
>
>> >>> Stuart Massey <djangoschef at gmail.com> schrieb am 22.01.2021 um 14:08
>> in
>> Nachricht
>> <CABQ68NTGDmxVo_uVLXg0HYtLgsMRGUCvCssa3eRGQfOv+CJ9zQ at mail.gmail.com>:
>> > Hi Ulrich,
>> > Thank you for your response.
>> > It makes sense that this would be happening on the failing,
>> secondary/slave
>> > node, in which case we might expect drbd to be restarted (the service
>> > entirely, since it is already demoted) on the slave. I don't understand
>> how
>> > it would affect the master, unless the failing secondary is causing some
>> > issue with drbd on the primary that causes the monitor on the master to
>> > time out for some reason. This does not (so far) seem to be the case, as
>> > the failing node has now been in maintenance mode for a couple of days
>> with
>> > drbd still running as secondary, so if drbd failures on the secondary
>> were
>> > causing the monitor on the Master/Primary to timeout, we should still be
>> > seeing that; we are not. The master has yet to demote the drbd resource
>> > since we put the failing node in maintenance.
>>
>> When you are in maintenance mode, monitor operations won't run AFAIK.
>>
>> > We will watch for a bit longer.
>> > Thanks again
>> >
>> > On Thu, Jan 21, 2021 at 2:23 AM Ulrich Windl <
>> > Ulrich.Windl at rz.uni-regensburg.de> wrote:
>> >
>> >> >>> Stuart Massey <stuart.e.massey at gmail.com> schrieb am 20.01.2021 um
>> >> 03:41
>> >> in
>> >> Nachricht
>> >> <CAJfrB75UPUmZJPjXCoACRDGoG-BqDcJHff5c_OmVBFya53D-dw at mail.gmail.com>:
>> >> > Strahil,
>> >> > That is very kind of you, thanks.
>> >> > I see that in your (feature set 3.4.1) cib, drbd is in a <clone>
>> with
>> >> some
>> >> > meta_attributes and operations having to do with promotion, while in
>> our
>> >> > (feature set 3.0.14) cib, drbd is in a <master> which does not have
>> those
>> >> > (maybe since promotion is implicit).
>> >> > Our cluster has been working quite well for some time, too. I wonder
>> what
>> >> > would happen if you could hang the os in one of your nodes? If a VM,
>> >> maybe
>> >>
>> >> Unless some other fencing mechanism (like watchdog timeout) kicks in,
>> thge
>> >> monitor operation is the only thing that can detect a problem (from the
>> >> cluster's view): The monitor operation would timeout. Then the cluster
>> >> would
>> >> try to restart the resource (stop, then start). If stop also times out
>> the
>> >> node
>> >> will be fenced.
>> >>
>> >> > the constrained secondary could be starved by setting disk IOPs to
>> >> > something really low. Of course, you are using different versions of
>> just
>> >> > about everything, as we're on centos7.
>> >> > Regards,
>> >> > Stuart
>> >> >
>> >> > On Tue, Jan 19, 2021 at 6:20 PM Strahil Nikolov <
>> hunter86_bg at yahoo.com>
>> >> > wrote:
>> >> >
>> >> >> I have just built a test cluster (centOS 8.3) for testing DRBD and
>> it
>> >> >> works quite fine.
>> >> >> Actually I followed my notes from
>> >> >> https://forums.centos.org/viewtopic.php?t=65539 with the exception
>> of
>> >> >> point 8 due to the "promotable" stuff.
>> >> >>
>> >> >> I'm attaching the output of 'pcs cluster cib file' and I hope it
>> helps
>> >> you
>> >> >> fix your issue.
>> >> >>
>> >> >> Best Regards,
>> >> >> Strahil Nikolov
>> >> >>
>> >> >>
>> >> >> В 09:32 -0500 на 19.01.2021 (вт), Stuart Massey написа:
>> >> >>
>> >> >> Ulrich,
>> >> >> Thank you for that observation. We share that concern.
>> >> >> We have 4 ea 1G nics active, bonded in pairs. One bonded pair serves
>> the
>> >> >> "public" (to the intranet) IPs, and the other bonded pair is
>> private to
>> >> the
>> >> >> cluster, used for drbd replication. HA will, I hope, be using the
>> >> "public"
>> >> >> IP, since that is the route to the IP addresses resolved for the
>> host
>> >> >> names; that will certainly be the only route to the quorum device. I
>> can
>> >> >> say that this cluster has run reasonably well for quite some time
>> with
>> >> this
>> >> >> configuration prior to the recently developed hardware issues on
>> one of
>> >> the
>> >> >> nodes.
>> >> >> Regards,
>> >> >> Stuart
>> >> >>
>> >> >> On Tue, Jan 19, 2021 at 2:49 AM Ulrich Windl <
>> >> >> Ulrich.Windl at rz.uni-regensburg.de> wrote:
>> >> >>
>> >> >> >>> Stuart Massey <djangoschef at gmail.com> schrieb am 19.01.2021 um
>> >> 04:46
>> >> >> in
>> >> >> Nachricht
>> >> >> <CABQ68NQuTyYXcYgwcUpg5TxxaJjwhSp+c6GCOKfOwGyRQSAAjQ at mail.gmail.com
>> >:
>> >> >> > So, we have a 2-node cluster with a quorum device. One of the
>> nodes
>> >> >> (node1)
>> >> >> > is having some trouble, so we have added constraints to prevent
>> any
>> >> >> > resources migrating to it, but have not put it in standby, so that
>> >> drbd
>> >> >> in
>> >> >> > secondary on that node stays in sync. The problems it is having
>> lead
>> >> to
>> >> >> OS
>> >> >> > lockups that eventually resolve themselves - but that causes it
>> to be
>> >> >> > temporarily dropped from the cluster by the current master
>> (node2).
>> >> >> > Sometimes when node1 rejoins, then node2 will demote the drbd ms
>> >> >> resource.
>> >> >> > That causes all resources that depend on it to be stopped,
>> leading to
>> >> a
>> >> >> > service outage. They are then restarted on node2, since they can't
>> run
>> >> on
>> >> >> > node1 (due to constraints).
>> >> >> > We are having a hard time understanding why this happens. It seems
>> >> like
>> >> >> > there may be some sort of DC contention happening. Does anyone
>> have
>> >> any
>> >> >> > idea how we might prevent this from happening?
>> >> >>
>> >> >> I think if you are routing high-volume DRBD traffic throuch "the
>> same
>> >> >> pipe" as the cluster communication, cluster communication may fail
>> if
>> >> the
>> >> >> pipe is satiated.
>> >> >> I'm not happy with that, but it seems to be that way.
>> >> >>
>> >> >> Maybe running a combination of iftop and iotop could help you
>> understand
>> >> >> what's going on...
>> >> >>
>> >> >> Regards,
>> >> >> Ulrich
>> >> >>
>> >> >> > Selected messages (de-identified) from pacemaker.log that
>> illustrate
>> >> >> > suspicion re DC confusion are below. The update_dc and
>> >> >> > abort_transition_graph re deletion of lrm seem to always precede
>> the
>> >> >> > demotion, and a demotion seems to always follow (when not already
>> >> >> demoted).
>> >> >> >
>> >> >> > Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
>> >> >> > do_dc_takeover:        Taking over DC status for this partition
>> >> >> > Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
>> >> >> update_dc:
>> >> >> >     Set DC to node02.example.com (3.0.14)
>> >> >> > Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
>> >> >> > abort_transition_graph:        Transition aborted by deletion of
>> >> >> > lrm[@id='1']: Resource state removal | cib=0.89.327
>> >> >> > source=abort_unless_down:357
>> >> >> > path=/cib/status/node_state[@id='1']/lrm[@id='1'] complete=true
>> >> >> > Jan 18 16:52:19 [21937] node02.example.com    pengine:     info:
>> >> >> > master_color:  ms_drbd_ourApp: Promoted 0 instances of a possible
>> 1
>> to
>> >> >> > master
>> >> >> > Jan 18 16:52:19 [21937] node02.example.com    pengine:   notice:
>> >> >> LogAction:
>> >> >> >      * Demote     drbd_ourApp:1     (            Master -> Slave
>> >> >> > node02.example.com )
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> _______________________________________________
>> >> >> Manage your subscription:
>> >> >> https://lists.clusterlabs.org/mailman/listinfo/users
>> >> >>
>> >> >> ClusterLabs home: https://www.clusterlabs.org/
>> >> >>
>> >> >> _______________________________________________
>> >> >>
>> >> >> Manage your subscription:
>> >> >>
>> >> >> https://lists.clusterlabs.org/mailman/listinfo/users
>> >> >>
>> >> >>
>> >> >>
>> >> >> ClusterLabs home:
>> >> >>
>> >> >> https://www.clusterlabs.org/
>> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> Manage your subscription:
>> >> https://lists.clusterlabs.org/mailman/listinfo/users
>> >>
>> >> ClusterLabs home: https://www.clusterlabs.org/
>> >>
>>
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20210126/7430c5f3/attachment-0001.htm>