[ClusterLabs] cluster does not detect kill on pacemaker process ?

Fri Apr 7 22:15:55 EDT 2017

On 04/07/2017 05:20 PM, neeraj ch wrote:
> I am running it on centos 6.6. I am killing the "pacemakerd" process
> using kill -9.

pacemakerd is a supervisor process that watches the other processes, and
respawns them if they die. It is not really responsible for anything in
the cluster directly. So, killing it does not disrupt the cluster in any
way, it just prevents automatic recovery if one of the other daemons dies.

When systemd is in use, systemd will restart pacemakerd if it dies, but
CentOS 6 does not have systemd (CentOS 7 does).

> hmm, stonith is used for detection as well? I thought it was used to
> disable malfunctioning nodes. 

If you kill pacemakerd, that doesn't cause any harm to the cluster, so
that would not involve stonith.

If you kill crmd or corosync instead, that would cause the node to leave
the cluster -- it would be considered a malfunctioning node. The rest of
the cluster would then use stonith to disable that node, so it could
safely recover its services elsewhere.

> On Fri, Apr 7, 2017 at 7:58 AM, Ken Gaillot <kgaillot at redhat.com
> <mailto:kgaillot at redhat.com>> wrote:
> 
>     On 04/05/2017 05:16 PM, neeraj ch wrote:
>     > Hello All,
>     >
>     > I noticed something on our pacemaker test cluster. The cluster is
>     > configured to manage an underlying database using master slave
>     primitive.
>     >
>     > I ran a kill on the pacemaker process, all the other nodes kept
>     showing
>     > the node online. I went on to kill the underlying database on the same
>     > node which would have been detected had the pacemaker on the node been
>     > online. The cluster did not detect that the database on the node has
>     > failed, the failover never occurred.
>     >
>     > I went on to kill corosync on the same node and the cluster now marked
>     > the node as stopped and proceeded to elect a new master.
>     >
>     >
>     > In a separate test. I killed the pacemaker process on the cluster DC,
>     > the cluster showed no change. I went on to change CIB on a different
>     > node. The CIB modify command timed out. Once that occurred, the node
>     > didn't failover even when I turned off corosync on cluster DC. The
>     > cluster didn't recover after this mishap.
>     >
>     > Is this expected behavior? Is there a solution for when OOM decides to
>     > kill the pacemaker process?
>     >
>     > I run pacemaker 1.1.14, with corosync 1.4. I have stonith disabled and
>     > quorum enabled.
>     >
>     > Thank you,
>     >
>     > nwarriorch
> 
>     What exactly are you doing to kill pacemaker? There are multiple
>     pacemaker processes, and they have different recovery methods.
> 
>     Also, what OS/version are you running? If it has systemd, that can play
>     a role in recovery as well.
> 
>     Having stonith disabled is a big part of what you're seeing. When a node
>     fails, stonith is the only way the rest of the cluster can be sure the
>     node is unable to cause trouble, so it can recover services elsewhere.