<div dir="ltr">I am running it on centos 6.6. I am killing the "pacemakerd" process using kill -9. <div><br></div><div>hmm, stonith is used for detection as well? I thought it was used to disable malfunctioning nodes. </div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Apr 7, 2017 at 7:58 AM, Ken Gaillot <span dir="ltr"><<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On 04/05/2017 05:16 PM, neeraj ch wrote:<br>

> Hello All,<br>

><br>

> I noticed something on our pacemaker test cluster. The cluster is<br>

> configured to manage an underlying database using master slave primitive.<br>

><br>

> I ran a kill on the pacemaker process, all the other nodes kept showing<br>

> the node online. I went on to kill the underlying database on the same<br>

> node which would have been detected had the pacemaker on the node been<br>

> online. The cluster did not detect that the database on the node has<br>

> failed, the failover never occurred.<br>

><br>

> I went on to kill corosync on the same node and the cluster now marked<br>

> the node as stopped and proceeded to elect a new master.<br>

><br>

><br>

> In a separate test. I killed the pacemaker process on the cluster DC,<br>

> the cluster showed no change. I went on to change CIB on a different<br>

> node. The CIB modify command timed out. Once that occurred, the node<br>

> didn't failover even when I turned off corosync on cluster DC. The<br>

> cluster didn't recover after this mishap.<br>

><br>

> Is this expected behavior? Is there a solution for when OOM decides to<br>

> kill the pacemaker process?<br>

><br>

> I run pacemaker 1.1.14, with corosync 1.4. I have stonith disabled and<br>

> quorum enabled.<br>

><br>

> Thank you,<br>

><br>

> nwarriorch<br>

<br>

</div></div>What exactly are you doing to kill pacemaker? There are multiple<br>

pacemaker processes, and they have different recovery methods.<br>

<br>

Also, what OS/version are you running? If it has systemd, that can play<br>

a role in recovery as well.<br>

<br>

Having stonith disabled is a big part of what you're seeing. When a node<br>

fails, stonith is the only way the rest of the cluster can be sure the<br>

node is unable to cause trouble, so it can recover services elsewhere.<br>

<br>

<br>

______________________________<wbr>_________________<br>

Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>

<a href="http://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.clusterlabs.org/<wbr>mailman/listinfo/users</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/<wbr>doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

</blockquote></div><br></div>