[ClusterLabs] cluster does not detect kill on pacemaker process ?
neeraj ch
nwarriorch at gmail.com
Sat Apr 8 00:20:38 CEST 2017
I am running it on centos 6.6. I am killing the "pacemakerd" process using
kill -9.
hmm, stonith is used for detection as well? I thought it was used to
disable malfunctioning nodes.
On Fri, Apr 7, 2017 at 7:58 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
> On 04/05/2017 05:16 PM, neeraj ch wrote:
> > Hello All,
> >
> > I noticed something on our pacemaker test cluster. The cluster is
> > configured to manage an underlying database using master slave primitive.
> >
> > I ran a kill on the pacemaker process, all the other nodes kept showing
> > the node online. I went on to kill the underlying database on the same
> > node which would have been detected had the pacemaker on the node been
> > online. The cluster did not detect that the database on the node has
> > failed, the failover never occurred.
> >
> > I went on to kill corosync on the same node and the cluster now marked
> > the node as stopped and proceeded to elect a new master.
> >
> >
> > In a separate test. I killed the pacemaker process on the cluster DC,
> > the cluster showed no change. I went on to change CIB on a different
> > node. The CIB modify command timed out. Once that occurred, the node
> > didn't failover even when I turned off corosync on cluster DC. The
> > cluster didn't recover after this mishap.
> >
> > Is this expected behavior? Is there a solution for when OOM decides to
> > kill the pacemaker process?
> >
> > I run pacemaker 1.1.14, with corosync 1.4. I have stonith disabled and
> > quorum enabled.
> >
> > Thank you,
> >
> > nwarriorch
>
> What exactly are you doing to kill pacemaker? There are multiple
> pacemaker processes, and they have different recovery methods.
>
> Also, what OS/version are you running? If it has systemd, that can play
> a role in recovery as well.
>
> Having stonith disabled is a big part of what you're seeing. When a node
> fails, stonith is the only way the rest of the cluster can be sure the
> node is unable to cause trouble, so it can recover services elsewhere.
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20170407/fa0095cf/attachment.html>
More information about the Users
mailing list