[ClusterLabs] Antw: Re: Antw: [EXT] DRBD ms resource keeps getting demoted

Thu Jan 21 02:23:05 EST 2021

>>> Stuart Massey <stuart.e.massey at gmail.com> schrieb am 20.01.2021 um 03:41
in
Nachricht
<CAJfrB75UPUmZJPjXCoACRDGoG-BqDcJHff5c_OmVBFya53D-dw at mail.gmail.com>:
> Strahil,
> That is very kind of you, thanks.
> I see that in your (feature set 3.4.1) cib, drbd is in a <clone>  with some
> meta_attributes and operations having to do with promotion, while in our
> (feature set 3.0.14) cib, drbd is in a <master> which does not have those
> (maybe since promotion is implicit).
> Our cluster has been working quite well for some time, too. I wonder what
> would happen if you could hang the os in one of your nodes? If a VM, maybe

Unless some other fencing mechanism (like watchdog timeout) kicks in, thge
monitor operation is the only thing that can detect a problem (from the
cluster's view): The monitor operation would timeout. Then the cluster would
try to restart the resource (stop, then start). If stop also times out the node
will be fenced.

> the constrained secondary could be starved by setting disk IOPs to
> something really low. Of course, you are using different versions of just
> about everything, as we're on centos7.
> Regards,
> Stuart
> 
> On Tue, Jan 19, 2021 at 6:20 PM Strahil Nikolov <hunter86_bg at yahoo.com>
> wrote:
> 
>> I have just built a test cluster (centOS 8.3) for testing DRBD and it
>> works quite fine.
>> Actually I followed my notes from
>> https://forums.centos.org/viewtopic.php?t=65539 with the exception of
>> point 8 due to the "promotable" stuff.
>>
>> I'm attaching the output of 'pcs cluster cib file' and I hope it helps you
>> fix your issue.
>>
>> Best Regards,
>> Strahil Nikolov
>>
>>
>> В 09:32 -0500 на 19.01.2021 (вт), Stuart Massey написа:
>>
>> Ulrich,
>> Thank you for that observation. We share that concern.
>> We have 4 ea 1G nics active, bonded in pairs. One bonded pair serves the
>> "public" (to the intranet) IPs, and the other bonded pair is private to
the
>> cluster, used for drbd replication. HA will, I hope, be using the "public"
>> IP, since that is the route to the IP addresses resolved for the host
>> names; that will certainly be the only route to the quorum device. I can
>> say that this cluster has run reasonably well for quite some time with
this
>> configuration prior to the recently developed hardware issues on one of
the
>> nodes.
>> Regards,
>> Stuart
>>
>> On Tue, Jan 19, 2021 at 2:49 AM Ulrich Windl <
>> Ulrich.Windl at rz.uni-regensburg.de> wrote:
>>
>> >>> Stuart Massey <djangoschef at gmail.com> schrieb am 19.01.2021 um 04:46
>> in
>> Nachricht
>> <CABQ68NQuTyYXcYgwcUpg5TxxaJjwhSp+c6GCOKfOwGyRQSAAjQ at mail.gmail.com>:
>> > So, we have a 2-node cluster with a quorum device. One of the nodes
>> (node1)
>> > is having some trouble, so we have added constraints to prevent any
>> > resources migrating to it, but have not put it in standby, so that drbd
>> in
>> > secondary on that node stays in sync. The problems it is having lead to
>> OS
>> > lockups that eventually resolve themselves - but that causes it to be
>> > temporarily dropped from the cluster by the current master (node2).
>> > Sometimes when node1 rejoins, then node2 will demote the drbd ms
>> resource.
>> > That causes all resources that depend on it to be stopped, leading to a
>> > service outage. They are then restarted on node2, since they can't run
on
>> > node1 (due to constraints).
>> > We are having a hard time understanding why this happens. It seems like
>> > there may be some sort of DC contention happening. Does anyone have any
>> > idea how we might prevent this from happening?
>>
>> I think if you are routing high-volume DRBD traffic throuch "the same
>> pipe" as the cluster communication, cluster communication may fail if the
>> pipe is satiated.
>> I'm not happy with that, but it seems to be that way.
>>
>> Maybe running a combination of iftop and iotop could help you understand
>> what's going on...
>>
>> Regards,
>> Ulrich
>>
>> > Selected messages (de-identified) from pacemaker.log that illustrate
>> > suspicion re DC confusion are below. The update_dc and
>> > abort_transition_graph re deletion of lrm seem to always precede the
>> > demotion, and a demotion seems to always follow (when not already
>> demoted).
>> >
>> > Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
>> > do_dc_takeover:        Taking over DC status for this partition
>> > Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
>> update_dc:
>> >     Set DC to node02.example.com (3.0.14)
>> > Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
>> > abort_transition_graph:        Transition aborted by deletion of
>> > lrm[@id='1']: Resource state removal | cib=0.89.327
>> > source=abort_unless_down:357
>> > path=/cib/status/node_state[@id='1']/lrm[@id='1'] complete=true
>> > Jan 18 16:52:19 [21937] node02.example.com    pengine:     info:
>> > master_color:  ms_drbd_ourApp: Promoted 0 instances of a possible 1 to
>> > master
>> > Jan 18 16:52:19 [21937] node02.example.com    pengine:   notice:
>> LogAction:
>> >      * Demote     drbd_ourApp:1     (            Master -> Slave
>> > node02.example.com )
>>
>>
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
>>
>> _______________________________________________
>>
>> Manage your subscription:
>>
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>>
>>
>> ClusterLabs home:
>>
>> https://www.clusterlabs.org/ 
>>
>>
>>