<div dir="ltr">Jan, thank you for the answer. I still have few questions.<div><br><div>Also, FYI, this happened on a cluster with just one node.</div><div>The cluster is designed to be a two-node cluster, with possibility to work even with only one node.<br><div><br></div><div>&gt;&gt; <span style="font-size:12.8px">and if corosync was not scheduled for more than token timeout &quot;Process pause detected for ...&quot; message is displayed and new membership is formed.</span></div><div><span style="font-size:12.8px">That meant other nodes will call STONITH to fence that node. Right? </span></div><div><span style="font-size:12.8px">Although, it didn&#39;t happen that way, because there was no the other &quot;online&quot; node in the cluster.</span></div><div><span style="font-size:12.8px">And because I didn&#39;t configure my fencing device to accept node-list, it was called by the cluster.</span></div><div><span style="font-size:12.8px">And the call was successful because of the logic - the other node was not connected to the STONITH physical device, equals to a &quot;successful STONITH&quot;.</span></div><div><span style="font-size:12.8px">And then that Pacemaker&#39;s logic worked out to to shut down itself (&quot;crit:&quot; message in the log).</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">&gt;&gt; </span><span style="font-size:12.8px">There is really no help. It&#39;s best to make sure corosync is scheduled regularly.</span></div><div><span style="font-size:12.8px">I may sound silly, but how can I do it?</span></div><div><br></div><div>I did:</div><div># grep -n --color &quot;Process pause detected for\|Corosync main process was not scheduled for\|invoked oom-killer\|kernel: Out of memory: Kill process\|kernel: Killed process\|RA:\|error:\|fence\|STONITH&quot; B5-2U-205-LS.log &gt;B5-2U-205-LS.log.cluster_excerpt<br></div><div><br></div><div>and attached it to the letter.</div><div><br></div><div><br></div><div>I see complete disaster in the syslog.</div><div>Correct me if I am wrong, but here I will try to analyze what happened to the cluster:</div><div><br></div><div>At some time, when the system has already been working under the really high load for about 16 hours, Corosync started to report that &quot;Corosync main process was not scheduled for ...&quot;.</div><div>Which means that Corosync wasn&#39;t scheduled by the OS often enough so it couldn&#39;t detect membership changes (token timeout).  </div><div>Then, after a few such messages which appeared almost in a row, monitor operation of few resources failed.</div><div>Question: in the log, entries from Resource Agent is shown first, then &quot;lrmd&quot; reports a timeout problem, like this:</div><div><br></div><div>Jan 29 07:00:19 B5-2U-205-LS diskHelper(sm0dh)[18835]: WARNING: RA: [monitor] : got rc=1<br></div><div><div>Jan 29 07:00:32 B5-2U-205-LS lrmd[3012]: notice: operation_finished: sm0dh_monitor_30000:18803:stderr [ Failed to get properties: Connection timed out ]</div></div><div><br></div><div>Does it mean that the monitor failed because of timeout?</div><div><br></div><div>Two minutes later Corosync started to report another message &quot;Process pause detected for ...&quot;</div><div>Which means that C<span style="font-size:12.8px">orosync was not scheduled for more than a token timeout.</span></div><div><span style="font-size:12.8px">Then five minutes later I heave this line in the log:</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">Jan 29 07:05:54 B5-2U-205-LS kernel: stonithd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0</span><br></div><div><span style="font-size:12.8px"><br></span></div></div><div><span style="font-size:12.8px">Which I assume states for &quot;stonithd tried to allocate some memory, and kernel decided to run oom-killer, because there was no enough available memory&quot;. I am right here?</span></div><div><span style="font-size:12.8px">Why stonithd activated that time? </span></div><div><span style="font-size:12.8px">Was it because of </span>&quot;Process pause detected for ...&quot; Corosync&#39;s message? </div><div>What stonithd actually aimed to do?</div><div><br></div><div>Then oom-killer kills one of heavy processes.</div><div>Then systemd-journal requests memory (?) and another one heavy resource goes down, killed by oom-killer.</div><div><br></div><div>Both killed resources was under Pacemaker&#39;s control.</div><div>Other processes managed by Pacemaker report monitor timeout (?).</div><div>Then one of them times out on &quot;stop&quot; operation and so Pacemaker requests a node to be STONITHed.</div><div>There is only one node in the cluster and the only running resource is not designed (properly - don&#39;t have &quot;node-list&quot;) to kill the node on which it runs.</div><div>And because there is no another guy physically connected to the fencing device - STONITH reports success.</div><div>Pacemaker&#39;s internal check works out (thank you guys!) and Pacemaker shuts down itself.</div><div><br></div><div><br></div><div>Please, correct me if I am wrong in this log analyzing. I just want to level up in understanding what is happening here.</div><div>As a general question, is this all happened because of:</div><div><br></div><div>For some reasons Corosync started to experience a lack of processor time (scheduling).</div><div>That is why monitor operations started to time out.</div><div>Than after &quot;Process pause detected for ...&quot; message I assume the node should be STONITHed by the other node, but there is no another node, so what should happen in that case?</div><div>Than for some reasons &quot;stonithd&quot; triggered &quot;oom-killer&quot;, which killed one of the managed resources. Why?</div><div>Monitor function time out for all resources continuously.</div><div>Eventually &quot;stop&quot; function times out for one of the resources, that is why Pacemaker eventually shuts down.</div><div><br></div><div>Please correct me in case I am wrong anywhere in my assumptions.</div><div><br></div><div>Thank you for spending your precious time reading all this =)</div><div>Hope for some help here =)</div><div><span style="font-size:12.8px"><br></span></div></div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><div dir="ltr"><div><div dir="ltr">Thank you,<div>Kostia</div></div></div></div></div></div>

<br><div class="gmail_quote">On Wed, Feb 17, 2016 at 6:47 PM, Jan Friesse <span dir="ltr">&lt;<a href="mailto:jfriesse@redhat.com" target="_blank">jfriesse@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Kostiantyn Ponomarenko napsal(a):<span class=""><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Thank you for the suggestion.<br>

The OS is Debian 8. All Packages are build by myself.<br>

libqb-0.17.2<br>

corosync-2.3.5<br>

cluster-glue-1.0.12<br>

pacemaker-1.1.13<br>

<br>

It is really important for me to understand what is happening with the<br>

cluster under the high load.<br>

</blockquote>

<br></span>

For Corosync it&#39;s really simple. Corosync has to be scheduled by OS regularly (more often than it&#39;s current token timeout) to be able to detect membership changes and send/receive messages (cpg). If it&#39;s not scheduled, membership is not up to date and eventually when it&#39;s finally scheduled, it logs &quot;process was not scheduled for ... ms&quot; message (warning for user) and if corosync was not scheduled for more than token timeout &quot;Process pause detected for ...&quot; message is displayed and new membership is formed. Other nodes (if scheduled regularly) sees non regularly scheduled node as dead.<span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

So I would appreciate any help here =)<br>

</blockquote>

<br></span>

There is really no help. It&#39;s best to make sure corosync is scheduled regularly.<span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<br>

Thank you,<br>

Kostia<br>

<br>

On Wed, Feb 17, 2016 at 5:02 PM, Greg Woods &lt;<a href="mailto:woods@ucar.edu" target="_blank">woods@ucar.edu</a>&gt; wrote:<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

On Wed, Feb 17, 2016 at 3:30 AM, Kostiantyn Ponomarenko &lt;<br>

<a href="mailto:konstantin.ponomarenko@gmail.com" target="_blank">konstantin.ponomarenko@gmail.com</a>&gt; wrote:<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Jan 29 07:00:43 B5-2U-205-LS corosync[2742]: [MAIN  ] Corosync main<br>

process was not scheduled for 12483.7363 ms (threshold is 800.0000 ms).<br>

Consider token timeout increase.<br>

</blockquote>

<br>

<br>

I was having this problem as well. You don&#39;t say which version of corosync<br>

you are running or on what OS, but on CentOS 7, there is an available<br>

</blockquote></blockquote>

<br></span>

This update sets round robin realtime scheduling for corosync by default. Same can be achieved without update by editing /etc/sysconfig/corosync and changing COROSYNC_OPTIONS line to something like COROSYNC_OPTIONS=&quot;-r&quot;<br>

<br>

Regards,<br>

  Honza<div class="HOEnZb"><div class="h5"><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

update that looks like it might address this (it has to do with<br>

scheduling). We haven&#39;t gotten around to actually applying it yet because<br>

it will require some down time on production services (we do have a few<br>

node-locked VMs in our cluster), and it only happens when the system is<br>

under very high load, so I can&#39;t say for sure the update will fix the<br>

issue, but it might be worth looking into.<br>

<br>

--Greg<br>

<br>

<br>

_______________________________________________<br>

Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>

<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/mailman/listinfo/users</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

<br>

<br>

</blockquote>

<br>

<br>

<br>

_______________________________________________<br>

Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>

<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/mailman/listinfo/users</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

<br>

</blockquote>

<br>

<br>

_______________________________________________<br>

Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>

<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/mailman/listinfo/users</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

</div></div></blockquote></div><br></div>