[ClusterLabs] Antw: [EXT] DRBD ms resource keeps getting demoted

Tue Jan 19 18:20:35 EST 2021

I have just built a test cluster (centOS 8.3) for testing DRBD and it
works quite fine.Actually I followed my notes from 
https://forums.centos.org/viewtopic.php?t=65539 with the exception of
point 8 due to the "promotable" stuff.
I'm attaching the output of 'pcs cluster cib file' and I hope it helps
you fix your issue.
Best Regards,Strahil Nikolov

В 09:32 -0500 на 19.01.2021 (вт), Stuart Massey написа:
> Ulrich,Thank you for that observation. We share that concern.
> We have 4 ea 1G nics active, bonded in pairs. One bonded pair serves
> the "public" (to the intranet) IPs, and the other bonded pair is
> private to the cluster, used for drbd replication. HA will, I hope,
> be using the "public" IP, since that is the route to the IP addresses
> resolved for the host names; that will certainly be the only route to
> the quorum device. I can say that this cluster has run reasonably
> well for quite some time with this configuration prior to the
> recently developed hardware issues on one of the nodes.
> Regards,
> Stuart
> 
> On Tue, Jan 19, 2021 at 2:49 AM Ulrich Windl <
> Ulrich.Windl at rz.uni-regensburg.de> wrote:
> > >>> Stuart Massey <djangoschef at gmail.com> schrieb am 19.01.2021 um
> > 04:46 in
> > 
> > Nachricht
> > 
> > <CABQ68NQuTyYXcYgwcUpg5TxxaJjwhSp+c6GCOKfOwGyRQSAAjQ at mail.gmail.com
> > >:
> > 
> > > So, we have a 2-node cluster with a quorum device. One of the
> > nodes (node1)
> > 
> > > is having some trouble, so we have added constraints to prevent
> > any
> > 
> > > resources migrating to it, but have not put it in standby, so
> > that drbd in
> > 
> > > secondary on that node stays in sync. The problems it is having
> > lead to OS
> > 
> > > lockups that eventually resolve themselves - but that causes it
> > to be
> > 
> > > temporarily dropped from the cluster by the current master
> > (node2).
> > 
> > > Sometimes when node1 rejoins, then node2 will demote the drbd ms
> > resource.
> > 
> > > That causes all resources that depend on it to be stopped,
> > leading to a
> > 
> > > service outage. They are then restarted on node2, since they
> > can't run on
> > 
> > > node1 (due to constraints).
> > 
> > > We are having a hard time understanding why this happens. It
> > seems like
> > 
> > > there may be some sort of DC contention happening. Does anyone
> > have any
> > 
> > > idea how we might prevent this from happening?
> > 
> > 
> > 
> > I think if you are routing high-volume DRBD traffic throuch "the
> > same pipe" as the cluster communication, cluster communication may
> > fail if the pipe is satiated.
> > 
> > I'm not happy with that, but it seems to be that way.
> > 
> > 
> > 
> > Maybe running a combination of iftop and iotop could help you
> > understand what's going on...
> > 
> > 
> > 
> > Regards,
> > 
> > Ulrich
> > 
> > 
> > 
> > > Selected messages (de-identified) from pacemaker.log that
> > illustrate
> > 
> > > suspicion re DC confusion are below. The update_dc and
> > 
> > > abort_transition_graph re deletion of lrm seem to always precede
> > the
> > 
> > > demotion, and a demotion seems to always follow (when not already
> > demoted).
> > 
> > > 
> > 
> > > Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
> > 
> > > do_dc_takeover:        Taking over DC status for this partition
> > 
> > > Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
> > update_dc:
> > 
> > >     Set DC to node02.example.com (3.0.14)
> > 
> > > Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
> > 
> > > abort_transition_graph:        Transition aborted by deletion of
> > 
> > > lrm[@id='1']: Resource state removal | cib=0.89.327
> > 
> > > source=abort_unless_down:357
> > 
> > > path=/cib/status/node_state[@id='1']/lrm[@id='1'] complete=true
> > 
> > > Jan 18 16:52:19 [21937] node02.example.com    pengine:     info:
> > 
> > > master_color:  ms_drbd_ourApp: Promoted 0 instances of a possible
> > 1 to
> > 
> > > master
> > 
> > > Jan 18 16:52:19 [21937] node02.example.com    pengine:   notice:
> > LogAction:
> > 
> > >      * Demote     drbd_ourApp:1     (            Master -> Slave
> > 
> > > node02.example.com )
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > 
> > Manage your subscription:
> > 
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
> > 
> 
> _______________________________________________Manage your
> subscription:https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20210120/14d93d06/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: drbd_cib_el83.xml
Type: application/xml
Size: 11983 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20210120/14d93d06/attachment-0001.wsdl>