<div dir="ltr">I am running it on centos 6.6. I am killing the "pacemakerd" process using kill -9. <div><br></div><div>hmm, stonith is used for detection as well? I thought it was used to disable malfunctioning nodes. </div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Apr 7, 2017 at 7:58 AM, Ken Gaillot <span dir="ltr"><<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On 04/05/2017 05:16 PM, neeraj ch wrote:<br>
> Hello All,<br>
><br>
> I noticed something on our pacemaker test cluster. The cluster is<br>
> configured to manage an underlying database using master slave primitive.<br>
><br>
> I ran a kill on the pacemaker process, all the other nodes kept showing<br>
> the node online. I went on to kill the underlying database on the same<br>
> node which would have been detected had the pacemaker on the node been<br>
> online. The cluster did not detect that the database on the node has<br>
> failed, the failover never occurred.<br>
><br>
> I went on to kill corosync on the same node and the cluster now marked<br>
> the node as stopped and proceeded to elect a new master.<br>
><br>
><br>
> In a separate test. I killed the pacemaker process on the cluster DC,<br>
> the cluster showed no change. I went on to change CIB on a different<br>
> node. The CIB modify command timed out. Once that occurred, the node<br>
> didn't failover even when I turned off corosync on cluster DC. The<br>
> cluster didn't recover after this mishap.<br>
><br>
> Is this expected behavior? Is there a solution for when OOM decides to<br>
> kill the pacemaker process?<br>
><br>
> I run pacemaker 1.1.14, with corosync 1.4. I have stonith disabled and<br>
> quorum enabled.<br>
><br>
> Thank you,<br>
><br>
> nwarriorch<br>
<br>
</div></div>What exactly are you doing to kill pacemaker? There are multiple<br>
pacemaker processes, and they have different recovery methods.<br>
<br>
Also, what OS/version are you running? If it has systemd, that can play<br>
a role in recovery as well.<br>
<br>
Having stonith disabled is a big part of what you're seeing. When a node<br>
fails, stonith is the only way the rest of the cluster can be sure the<br>
node is unable to cause trouble, so it can recover services elsewhere.<br>
<br>
<br>
______________________________<wbr>_________________<br>
Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>
<a href="http://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.clusterlabs.org/<wbr>mailman/listinfo/users</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/<wbr>doc/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>
</blockquote></div><br></div>