[ClusterLabs] Antw: Re: Antw: [EXT] DRBD ms resource keeps getting demoted
Strahil Nikolov
hunter86_bg at yahoo.com
Sat Jan 23 08:31:11 EST 2021
Srry for the top posting.My iSCSILogicalUnit is blocking failover on
"standby" (I think it's a bug in the resource), yet without it -> drbd
fails over properly:
[root at drbd1 ~]# pcs resource show DRBD Resource: DRBD (class=ocf
provider=linbit type=drbd) Attributes: drbd_resource=drbd0
Operations: demote interval=0s timeout=90 (DRBD-demote-interval-
0s) monitor interval=30 role=Slave (DRBD-monitor-interval-
30) monitor interval=15 role=Master (DRBD-monitor-
interval-15) notify interval=0s timeout=90 (DRBD-notify-
interval-0s) promote interval=0s timeout=90 (DRBD-promote-
interval-0s) reload interval=0s timeout=30 (DRBD-reload-
interval-0s) start interval=0s timeout=240 (DRBD-start-
interval-0s) stop interval=0s timeout=100 (DRBD-stop-
interval-0s)[root at drbd1 ~]# pcs resource show DRBD-CLONE Master: DRBD-
CLONE Meta Attrs: clone-max=3 clone-node-max=1 master-max=1 master-
node-max=1 notify=true Resource: DRBD (class=ocf provider=linbit
type=drbd) Attributes: drbd_resource=drbd0 Operations: demote
interval=0s timeout=90 (DRBD-demote-interval-0s) monitor
interval=30 role=Slave (DRBD-monitor-interval-30) monitor
interval=15 role=Master (DRBD-monitor-interval-15) notify
interval=0s timeout=90 (DRBD-notify-interval-0s) promote
interval=0s timeout=90 (DRBD-promote-interval-0s) reload
interval=0s timeout=30 (DRBD-reload-interval-0s) start
interval=0s timeout=240 (DRBD-start-interval-0s) stop
interval=0s timeout=100 (DRBD-stop-interval-0s)
Best Regards,Strahil Nikolov
В 23:30 -0500 на 21.01.2021 (чт), Stuart Massey написа:
> Hi Ulrich,
> Thank you for your response.
> It makes sense that this would be happening on the failing,
> secondary/slave node, in which case we might expect drbd to be
> restarted (entirely, since it is already demoted) on the slave. I
> don't see how it would affect the master, unless the failing
> secondary is causing some issue with drbd on the primary that causes
> the monitor on the master to time out for some reason. This does not
> (so far) seem to be the case, as the failing node has now been in
> maintenance mode for a couple of days with drbd still running as
> secondary, so if drbd failures on the secondary were causing the
> monitor on the Master/Primary to timeout, we should still be seeing
> that; we are not. The master has yet to demote the drbd resource
> since we put the failing node in maintenance.
> We will watch for a bit longer.
> Thanks again
>
>
>
> On Thu, Jan 21, 2021, 2:23 AM Ulrich Windl <
> Ulrich.Windl at rz.uni-regensburg.de> wrote:
> > >>> Stuart Massey <stuart.e.massey at gmail.com> schrieb am 20.01.2021
> > um 03:41
> >
> > in
> >
> > Nachricht
> >
> > <CAJfrB75UPUmZJPjXCoACRDGoG-BqDcJHff5c_OmVBFya53D-dw at mail.gmail.com
> > >:
> >
> > > Strahil,
> >
> > > That is very kind of you, thanks.
> >
> > > I see that in your (feature set 3.4.1) cib, drbd is in a <clone>
> > with some
> >
> > > meta_attributes and operations having to do with promotion, while
> > in our
> >
> > > (feature set 3.0.14) cib, drbd is in a <master> which does not
> > have those
> >
> > > (maybe since promotion is implicit).
> >
> > > Our cluster has been working quite well for some time, too. I
> > wonder what
> >
> > > would happen if you could hang the os in one of your nodes? If a
> > VM, maybe
> >
> >
> >
> > Unless some other fencing mechanism (like watchdog timeout) kicks
> > in, thge
> >
> > monitor operation is the only thing that can detect a problem (from
> > the
> >
> > cluster's view): The monitor operation would timeout. Then the
> > cluster would
> >
> > try to restart the resource (stop, then start). If stop also times
> > out the node
> >
> > will be fenced.
> >
> >
> >
> > > the constrained secondary could be starved by setting disk IOPs
> > to
> >
> > > something really low. Of course, you are using different versions
> > of just
> >
> > > about everything, as we're on centos7.
> >
> > > Regards,
> >
> > > Stuart
> >
> > >
> >
> > > On Tue, Jan 19, 2021 at 6:20 PM Strahil Nikolov <
> > hunter86_bg at yahoo.com>
> >
> > > wrote:
> >
> > >
> >
> > >> I have just built a test cluster (centOS 8.3) for testing DRBD
> > and it
> >
> > >> works quite fine.
> >
> > >> Actually I followed my notes from
> >
> > >> https://forums.centos.org/viewtopic.php?t=65539 with the
> > exception of
> >
> > >> point 8 due to the "promotable" stuff.
> >
> > >>
> >
> > >> I'm attaching the output of 'pcs cluster cib file' and I hope it
> > helps you
> >
> > >> fix your issue.
> >
> > >>
> >
> > >> Best Regards,
> >
> > >> Strahil Nikolov
> >
> > >>
> >
> > >>
> >
> > >> В 09:32 -0500 на 19.01.2021 (вт), Stuart Massey написа:
> >
> > >>
> >
> > >> Ulrich,
> >
> > >> Thank you for that observation. We share that concern.
> >
> > >> We have 4 ea 1G nics active, bonded in pairs. One bonded pair
> > serves the
> >
> > >> "public" (to the intranet) IPs, and the other bonded pair is
> > private to
> >
> > the
> >
> > >> cluster, used for drbd replication. HA will, I hope, be using
> > the "public"
> >
> > >> IP, since that is the route to the IP addresses resolved for the
> > host
> >
> > >> names; that will certainly be the only route to the quorum
> > device. I can
> >
> > >> say that this cluster has run reasonably well for quite some
> > time with
> >
> > this
> >
> > >> configuration prior to the recently developed hardware issues on
> > one of
> >
> > the
> >
> > >> nodes.
> >
> > >> Regards,
> >
> > >> Stuart
> >
> > >>
> >
> > >> On Tue, Jan 19, 2021 at 2:49 AM Ulrich Windl <
> >
> > >> Ulrich.Windl at rz.uni-regensburg.de> wrote:
> >
> > >>
> >
> > >> >>> Stuart Massey <djangoschef at gmail.com> schrieb am 19.01.2021
> > um 04:46
> >
> > >> in
> >
> > >> Nachricht
> >
> > >> <
> > CABQ68NQuTyYXcYgwcUpg5TxxaJjwhSp+c6GCOKfOwGyRQSAAjQ at mail.gmail.com>
> > :
> >
> > >> > So, we have a 2-node cluster with a quorum device. One of the
> > nodes
> >
> > >> (node1)
> >
> > >> > is having some trouble, so we have added constraints to
> > prevent any
> >
> > >> > resources migrating to it, but have not put it in standby, so
> > that drbd
> >
> > >> in
> >
> > >> > secondary on that node stays in sync. The problems it is
> > having lead to
> >
> > >> OS
> >
> > >> > lockups that eventually resolve themselves - but that causes
> > it to be
> >
> > >> > temporarily dropped from the cluster by the current master
> > (node2).
> >
> > >> > Sometimes when node1 rejoins, then node2 will demote the drbd
> > ms
> >
> > >> resource.
> >
> > >> > That causes all resources that depend on it to be stopped,
> > leading to a
> >
> > >> > service outage. They are then restarted on node2, since they
> > can't run
> >
> > on
> >
> > >> > node1 (due to constraints).
> >
> > >> > We are having a hard time understanding why this happens. It
> > seems like
> >
> > >> > there may be some sort of DC contention happening. Does anyone
> > have any
> >
> > >> > idea how we might prevent this from happening?
> >
> > >>
> >
> > >> I think if you are routing high-volume DRBD traffic throuch "the
> > same
> >
> > >> pipe" as the cluster communication, cluster communication may
> > fail if the
> >
> > >> pipe is satiated.
> >
> > >> I'm not happy with that, but it seems to be that way.
> >
> > >>
> >
> > >> Maybe running a combination of iftop and iotop could help you
> > understand
> >
> > >> what's going on...
> >
> > >>
> >
> > >> Regards,
> >
> > >> Ulrich
> >
> > >>
> >
> > >> > Selected messages (de-identified) from pacemaker.log that
> > illustrate
> >
> > >> > suspicion re DC confusion are below. The update_dc and
> >
> > >> > abort_transition_graph re deletion of lrm seem to always
> > precede the
> >
> > >> > demotion, and a demotion seems to always follow (when not
> > already
> >
> > >> demoted).
> >
> > >> >
> >
> > >> > Jan 18 16:52:17 [21938] node02.example.com crmd:
> > info:
> >
> > >> > do_dc_takeover: Taking over DC status for this
> > partition
> >
> > >> > Jan 18 16:52:17 [21938] node02.example.com crmd:
> > info:
> >
> > >> update_dc:
> >
> > >> > Set DC to node02.example.com (3.0.14)
> >
> > >> > Jan 18 16:52:17 [21938] node02.example.com crmd:
> > info:
> >
> > >> > abort_transition_graph: Transition aborted by deletion
> > of
> >
> > >> > lrm[@id='1']: Resource state removal | cib=0.89.327
> >
> > >> > source=abort_unless_down:357
> >
> > >> > path=/cib/status/node_state[@id='1']/lrm[@id='1']
> > complete=true
> >
> > >> > Jan 18 16:52:19 [21937] node02.example.com pengine:
> > info:
> >
> > >> > master_color: ms_drbd_ourApp: Promoted 0 instances of a
> > possible 1 to
> >
> > >> > master
> >
> > >> > Jan 18 16:52:19 [21937] node02.example.com pengine:
> > notice:
> >
> > >> LogAction:
> >
> > >> > * Demote drbd_ourApp:1 ( Master ->
> > Slave
> >
> > >> > node02.example.com )
> >
> > >>
> >
> > >>
> >
> > >>
> >
> > >>
> >
> > >> _______________________________________________
> >
> > >> Manage your subscription:
> >
> > >> https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > >>
> >
> > >> ClusterLabs home: https://www.clusterlabs.org/
> >
> > >>
> >
> > >> _______________________________________________
> >
> > >>
> >
> > >> Manage your subscription:
> >
> > >>
> >
> > >> https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > >>
> >
> > >>
> >
> > >>
> >
> > >> ClusterLabs home:
> >
> > >>
> >
> > >> https://www.clusterlabs.org/
> >
> > >>
> >
> > >>
> >
> > >>
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> >
> > Manage your subscription:
> >
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> >
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
>
> _______________________________________________Manage your
> subscription:https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20210123/d0c145a0/attachment-0001.htm>
More information about the Users
mailing list