[ClusterLabs] cluster does not detect kill on pacemaker process ?

Ken Gaillot kgaillot at redhat.com
Fri Apr 7 16:58:50 CEST 2017


On 04/05/2017 05:16 PM, neeraj ch wrote:
> Hello All, 
> 
> I noticed something on our pacemaker test cluster. The cluster is
> configured to manage an underlying database using master slave primitive. 
> 
> I ran a kill on the pacemaker process, all the other nodes kept showing
> the node online. I went on to kill the underlying database on the same
> node which would have been detected had the pacemaker on the node been
> online. The cluster did not detect that the database on the node has
> failed, the failover never occurred. 
> 
> I went on to kill corosync on the same node and the cluster now marked
> the node as stopped and proceeded to elect a new master. 
> 
> 
> In a separate test. I killed the pacemaker process on the cluster DC,
> the cluster showed no change. I went on to change CIB on a different
> node. The CIB modify command timed out. Once that occurred, the node
> didn't failover even when I turned off corosync on cluster DC. The
> cluster didn't recover after this mishap. 
> 
> Is this expected behavior? Is there a solution for when OOM decides to
> kill the pacemaker process? 
> 
> I run pacemaker 1.1.14, with corosync 1.4. I have stonith disabled and
> quorum enabled. 
> 
> Thank you,
> 
> nwarriorch

What exactly are you doing to kill pacemaker? There are multiple
pacemaker processes, and they have different recovery methods.

Also, what OS/version are you running? If it has systemd, that can play
a role in recovery as well.

Having stonith disabled is a big part of what you're seeing. When a node
fails, stonith is the only way the rest of the cluster can be sure the
node is unable to cause trouble, so it can recover services elsewhere.




More information about the Users mailing list