[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] DRBD ms resource keeps getting demoted

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Mon Jan 25 02:06:50 EST 2021


>>> Stuart Massey <djangoschef at gmail.com> schrieb am 22.01.2021 um 14:08 in
Nachricht
<CABQ68NTGDmxVo_uVLXg0HYtLgsMRGUCvCssa3eRGQfOv+CJ9zQ at mail.gmail.com>:
> Hi Ulrich,
> Thank you for your response.
> It makes sense that this would be happening on the failing, secondary/slave
> node, in which case we might expect drbd to be restarted (the service
> entirely, since it is already demoted) on the slave. I don't understand how
> it would affect the master, unless the failing secondary is causing some
> issue with drbd on the primary that causes the monitor on the master to
> time out for some reason. This does not (so far) seem to be the case, as
> the failing node has now been in maintenance mode for a couple of days with
> drbd still running as secondary, so if drbd failures on the secondary were
> causing the monitor on the Master/Primary to timeout, we should still be
> seeing that; we are not. The master has yet to demote the drbd resource
> since we put the failing node in maintenance.

When you are in maintenance mode, monitor operations won't run AFAIK.

> We will watch for a bit longer.
> Thanks again
> 
> On Thu, Jan 21, 2021 at 2:23 AM Ulrich Windl <
> Ulrich.Windl at rz.uni-regensburg.de> wrote:
> 
>> >>> Stuart Massey <stuart.e.massey at gmail.com> schrieb am 20.01.2021 um
>> 03:41
>> in
>> Nachricht
>> <CAJfrB75UPUmZJPjXCoACRDGoG-BqDcJHff5c_OmVBFya53D-dw at mail.gmail.com>:
>> > Strahil,
>> > That is very kind of you, thanks.
>> > I see that in your (feature set 3.4.1) cib, drbd is in a <clone>  with
>> some
>> > meta_attributes and operations having to do with promotion, while in our
>> > (feature set 3.0.14) cib, drbd is in a <master> which does not have
those
>> > (maybe since promotion is implicit).
>> > Our cluster has been working quite well for some time, too. I wonder
what
>> > would happen if you could hang the os in one of your nodes? If a VM,
>> maybe
>>
>> Unless some other fencing mechanism (like watchdog timeout) kicks in, thge
>> monitor operation is the only thing that can detect a problem (from the
>> cluster's view): The monitor operation would timeout. Then the cluster
>> would
>> try to restart the resource (stop, then start). If stop also times out the
>> node
>> will be fenced.
>>
>> > the constrained secondary could be starved by setting disk IOPs to
>> > something really low. Of course, you are using different versions of
just
>> > about everything, as we're on centos7.
>> > Regards,
>> > Stuart
>> >
>> > On Tue, Jan 19, 2021 at 6:20 PM Strahil Nikolov <hunter86_bg at yahoo.com>
>> > wrote:
>> >
>> >> I have just built a test cluster (centOS 8.3) for testing DRBD and it
>> >> works quite fine.
>> >> Actually I followed my notes from
>> >> https://forums.centos.org/viewtopic.php?t=65539 with the exception of
>> >> point 8 due to the "promotable" stuff.
>> >>
>> >> I'm attaching the output of 'pcs cluster cib file' and I hope it helps
>> you
>> >> fix your issue.
>> >>
>> >> Best Regards,
>> >> Strahil Nikolov
>> >>
>> >>
>> >> В 09:32 -0500 на 19.01.2021 (вт), Stuart Massey написа:
>> >>
>> >> Ulrich,
>> >> Thank you for that observation. We share that concern.
>> >> We have 4 ea 1G nics active, bonded in pairs. One bonded pair serves
the
>> >> "public" (to the intranet) IPs, and the other bonded pair is private to
>> the
>> >> cluster, used for drbd replication. HA will, I hope, be using the
>> "public"
>> >> IP, since that is the route to the IP addresses resolved for the host
>> >> names; that will certainly be the only route to the quorum device. I
can
>> >> say that this cluster has run reasonably well for quite some time with
>> this
>> >> configuration prior to the recently developed hardware issues on one of
>> the
>> >> nodes.
>> >> Regards,
>> >> Stuart
>> >>
>> >> On Tue, Jan 19, 2021 at 2:49 AM Ulrich Windl <
>> >> Ulrich.Windl at rz.uni-regensburg.de> wrote:
>> >>
>> >> >>> Stuart Massey <djangoschef at gmail.com> schrieb am 19.01.2021 um
>> 04:46
>> >> in
>> >> Nachricht
>> >> <CABQ68NQuTyYXcYgwcUpg5TxxaJjwhSp+c6GCOKfOwGyRQSAAjQ at mail.gmail.com>:
>> >> > So, we have a 2-node cluster with a quorum device. One of the nodes
>> >> (node1)
>> >> > is having some trouble, so we have added constraints to prevent any
>> >> > resources migrating to it, but have not put it in standby, so that
>> drbd
>> >> in
>> >> > secondary on that node stays in sync. The problems it is having lead
>> to
>> >> OS
>> >> > lockups that eventually resolve themselves - but that causes it to be
>> >> > temporarily dropped from the cluster by the current master (node2).
>> >> > Sometimes when node1 rejoins, then node2 will demote the drbd ms
>> >> resource.
>> >> > That causes all resources that depend on it to be stopped, leading to
>> a
>> >> > service outage. They are then restarted on node2, since they can't
run
>> on
>> >> > node1 (due to constraints).
>> >> > We are having a hard time understanding why this happens. It seems
>> like
>> >> > there may be some sort of DC contention happening. Does anyone have
>> any
>> >> > idea how we might prevent this from happening?
>> >>
>> >> I think if you are routing high-volume DRBD traffic throuch "the same
>> >> pipe" as the cluster communication, cluster communication may fail if
>> the
>> >> pipe is satiated.
>> >> I'm not happy with that, but it seems to be that way.
>> >>
>> >> Maybe running a combination of iftop and iotop could help you
understand
>> >> what's going on...
>> >>
>> >> Regards,
>> >> Ulrich
>> >>
>> >> > Selected messages (de-identified) from pacemaker.log that illustrate
>> >> > suspicion re DC confusion are below. The update_dc and
>> >> > abort_transition_graph re deletion of lrm seem to always precede the
>> >> > demotion, and a demotion seems to always follow (when not already
>> >> demoted).
>> >> >
>> >> > Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
>> >> > do_dc_takeover:        Taking over DC status for this partition
>> >> > Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
>> >> update_dc:
>> >> >     Set DC to node02.example.com (3.0.14)
>> >> > Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
>> >> > abort_transition_graph:        Transition aborted by deletion of
>> >> > lrm[@id='1']: Resource state removal | cib=0.89.327
>> >> > source=abort_unless_down:357
>> >> > path=/cib/status/node_state[@id='1']/lrm[@id='1'] complete=true
>> >> > Jan 18 16:52:19 [21937] node02.example.com    pengine:     info:
>> >> > master_color:  ms_drbd_ourApp: Promoted 0 instances of a possible 1
to
>> >> > master
>> >> > Jan 18 16:52:19 [21937] node02.example.com    pengine:   notice:
>> >> LogAction:
>> >> >      * Demote     drbd_ourApp:1     (            Master -> Slave
>> >> > node02.example.com )
>> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> Manage your subscription:
>> >> https://lists.clusterlabs.org/mailman/listinfo/users 
>> >>
>> >> ClusterLabs home: https://www.clusterlabs.org/ 
>> >>
>> >> _______________________________________________
>> >>
>> >> Manage your subscription:
>> >>
>> >> https://lists.clusterlabs.org/mailman/listinfo/users 
>> >>
>> >>
>> >>
>> >> ClusterLabs home:
>> >>
>> >> https://www.clusterlabs.org/ 
>> >>
>> >>
>> >>
>>
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
>>





More information about the Users mailing list