<div dir="ltr">I saw Thomas's post from last week and it sounded very similar to what we saw, but I wasn't sure if the heartbeat/corosync difference made this a different issue.  I'm trying to dup and assemble the log/config info.<div><br></div><div>Thanks again,</div><div>Chris</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Aug 3, 2015 at 11:27 AM, emmanuel segura <span dir="ltr"><<a href="mailto:emi2fast@gmail.com" target="_blank">emi2fast@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">From what I see, he is using heartbeat.<br>

<div><div class="h5"><br>

2015-08-03 17:14 GMT+02:00 Thomas Meagher <<a href="mailto:thomas.meagher@hds.com">thomas.meagher@hds.com</a>>:<br>

><br>

> Sounds similar to the issue I described here last week.  We also had two<br>

> nodes, and lost network connection between the two nodes while one was<br>

> starting up after a fence.  Although we had stonith resources configured,<br>

> those resources were never called, and the cluster was considered active on<br>

> both nodes throughout the network split.  We were able to reproduce this<br>

> issue in our lab, it seems there is a window during corosync startup where<br>

> if a node joins the cluster and then leaves before Pacemaker stonith<br>

> resources have started, it will not be fenced.  This issue may be isolated<br>

> to two node systems, as normally a single node that is separated from<br>

> cluster will have lost quorum, which is not the case with two_node.<br>

><br>

> Are you running with "two_node" in corosync.conf?<br>

> Are you running with "wait_for_all"? (It's on by default with "two_node")<br>

><br>

> ________________________________<br>

> From: Chris Walker [<a href="mailto:christopher.walker@gmail.com">christopher.walker@gmail.com</a>]<br>

> Sent: Sunday, August 02, 2015 23:02<br>

> To: <a href="mailto:pacemaker@oss.clusterlabs.org">pacemaker@oss.clusterlabs.org</a><br>

> Subject: [Pacemaker] Node lost early in HA startup --> no STONITH<br>

><br>

> Hello,<br>

><br>

> We recently had an unfortunate sequence on our two-node cluster (nodes n02<br>

> and n03) that can be summarized as:<br>

> 1.  n03 became pathologically busy and was STONITHed by n02<br>

> 2.  The heavy load migrated to n02, which also became pathologically busy<br>

> 3.  n03 was rebooted<br>

> 4.  During the startup of HA on n03, n02 was initially seen by n03:<br>

><br>

> Jul 26 15:23:43 n03 crmd: [143569]: info: crm_update_peer_proc: n02.ais is<br>

> now online<br>

><br>

> 5.  But later during the startup sequence (after DC election and CIB sync)<br>

> we see n02 die (n02 is really wrapped around the axle, many stuck threads,<br>

> etc)<br>

><br>

> Jul 26 15:27:44 n03 heartbeat: [143544]: WARN: node n02: is dead<br>

> ...<br>

> Jul 26 15:27:45 n03 crmd: [143569]: info: ais_status_callback: status: n02<br>

> is now lost (was member)<br>

><br>

> our deadtime is 240 seconds, so n02 became unresponsive almost immediately<br>

> after n03 reported it up at 15:23:43<br>

><br>

> 6.  The troubling aspect of this incident is that even though there are<br>

> multiple STONITH resources configured for n03, none of them was engaged and<br>

> n03 then mounted filesystems that were also active on n02.<br>

><br>

> I'm wondering whether the fact that no STONITH resources were started by<br>

> this time explains why n02 was not STONITHed.  Shortly after n02 is declared<br>

> dead we see STONITH resources begin starting, e.g.,<br>

><br>

> Jul 26 15:27:47 n03 pengine: [152499]: notice: LogActions: Start<br>

> n03-3-ipmi-stonith (n03)<br>

><br>

> Does the fact that since there were no active STONITH resources when n02 was<br>

> declared dead, no STONITH action was taken against this node?  Is there a<br>

> fix/workaround for this scenario (we're using heartbeat 3.0.5 and pacemaker<br>

> 3.1.6 (RHEL6.2))?<br>

><br>

> Thanks very much!<br>

> Chris<br>

><br>

</div></div>> _______________________________________________<br>

> Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

> <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" rel="noreferrer" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

<span class="im HOEnZb">><br>

> Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

> Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

> Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

><br>

<br>

<br>

<br>

</span><span class="HOEnZb"><font color="#888888">--<br>

  .~.<br>

  /V\<br>

 //  \\<br>

/(   )\<br>

^`~'^<br>

<br>

_______________________________________________<br>

Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" rel="noreferrer" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

</font></span><div class="HOEnZb"><div class="h5"><br>

Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

</div></div></blockquote></div><br></div>