<div dir="ltr">Can you share the cluster configuration (e.g., `pcs config` or the CIB)? And are there any additional LogAction messages after that one (e.g., Promote for node01)?<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jan 18, 2021 at 7:47 PM Stuart Massey <<a href="mailto:djangoschef@gmail.com">djangoschef@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">So, we have a 2-node cluster with a quorum device. One of the nodes (node1) is having some trouble, so we have added constraints to prevent any resources migrating to it, but have not put it in standby, so that drbd in secondary on that node stays in sync. The problems it is having lead to OS lockups that eventually resolve themselves - but that causes it to be temporarily dropped from the cluster by the current master (node2). <div>Sometimes when node1 rejoins, then node2 will demote the drbd ms resource. That causes all resources that depend on it to be stopped, leading to a service outage. They are then restarted on node2, since they can't run on node1 (due to constraints).</div><div>We are having a hard time understanding why this happens. It seems like there may be some sort of DC contention happening. Does anyone have any idea how we might prevent this from happening?</div><div>Selected messages (de-identified) from pacemaker.log that illustrate suspicion re DC confusion are below. The update_dc and abort_transition_graph re deletion of lrm seem to always precede the demotion, and a demotion seems to always follow (when not already demoted).</div><div><br></div><div><div>Jan 18 16:52:17 [21938] <a href="http://node02.example.com" target="_blank">node02.example.com</a> crmd: info: do_dc_takeover: Taking over DC status for this partition</div><div>Jan 18 16:52:17 [21938] <a href="http://node02.example.com" target="_blank">node02.example.com</a> crmd: info: update_dc: Set DC to <a href="http://node02.example.com" target="_blank">node02.example.com</a> (3.0.14)</div><div>Jan 18 16:52:17 [21938] <a href="http://node02.example.com" target="_blank">node02.example.com</a> crmd: info: abort_transition_graph: Transition aborted by deletion of lrm[@id='1']: Resource state removal | cib=0.89.327 source=abort_unless_down:357 path=/cib/status/node_state[@id='1']/lrm[@id='1'] complete=true</div><div>Jan 18 16:52:19 [21937] <a href="http://node02.example.com" target="_blank">node02.example.com</a> pengine: info: master_color: ms_drbd_ourApp: Promoted 0 instances of a possible 1 to master</div><div>Jan 18 16:52:19 [21937] <a href="http://node02.example.com" target="_blank">node02.example.com</a> pengine: notice: LogAction: * Demote drbd_ourApp:1 ( Master -> Slave <a href="http://node02.example.com" target="_blank">node02.example.com</a> ) </div></div><div><br></div></div></div>
_______________________________________________<br>
Manage your subscription:<br>
<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>
<br>
ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div>Regards,<br><br></div>Reid Wahl, RHCA<br></div><div>Senior Software Maintenance Engineer, Red Hat<br></div>CEE - Platform Support Delivery - ClusterHA</div></div></div></div></div></div></div></div></div></div></div></div></div></div>