[ClusterLabs] Understanding the behavior of pacemaker crash
Ken Gaillot
kgaillot at redhat.com
Fri Sep 28 09:50:42 EDT 2018
On Fri, 2018-09-28 at 15:26 +0530, Prasad Nagaraj wrote:
> Hi Ken - Only if I turn off corosync on the node [ where I crashed
> pacemaker] other nodes are able to detect and put the node as
> OFFLINE.
> Do you have any other guidance or insights into this ?
Yes, corosync is the cluster membership layer -- if corosync is
successfully running, then the node is a member of the cluster.
Pacemaker's crmd provides a higher level of membership; typically, with
corosync but no crmd, the node shows up as "pending" in status. However
I am not sure how it worked with the old corosync plugin.
>
> Thanks
> Prasad
>
> On Thu, Sep 27, 2018 at 9:33 PM Prasad Nagaraj <prasad.nagaraj76 at gmai
> l.com> wrote:
> > Hi Ken - Thanks for the response. Pacemaker is still not running on
> > that node. So I am still wondering what could be the issue ? Any
> > other configurations or logs should I be sharing to understand this
> > more ?
> >
> > Thanks!
> >
> > On Thu, Sep 27, 2018 at 8:08 PM Ken Gaillot <kgaillot at redhat.com>
> > wrote:
> > > On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote:
> > > > Hello - I was trying to understand the behavior or cluster when
> > > > pacemaker crashes on one of the nodes. So I hard killed
> > > pacemakerd
> > > > and its related processes.
> > > >
> > > > -------------------------------------------------------------
> > > ------
> > > > -------------------------------------
> > > > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> > > > root 74022 1 0 07:53 pts/0 00:00:00 pacemakerd
> > > > 189 74028 74022 0 07:53 ? 00:00:00
> > > > /usr/libexec/pacemaker/cib
> > > > root 74029 74022 0 07:53 ? 00:00:00
> > > > /usr/libexec/pacemaker/stonithd
> > > > root 74030 74022 0 07:53 ? 00:00:00
> > > > /usr/libexec/pacemaker/lrmd
> > > > 189 74031 74022 0 07:53 ? 00:00:00
> > > > /usr/libexec/pacemaker/attrd
> > > > 189 74032 74022 0 07:53 ? 00:00:00
> > > > /usr/libexec/pacemaker/pengine
> > > > 189 74033 74022 0 07:53 ? 00:00:00
> > > > /usr/libexec/pacemaker/crmd
> > > >
> > > > root 75228 50092 0 07:54 pts/0 00:00:00 grep
> > > pacemaker
> > > > [root at SG-mysqlold-907 azureuser]# kill -9 74022
> > > >
> > > > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> > > > root 74030 1 0 07:53 ? 00:00:00
> > > > /usr/libexec/pacemaker/lrmd
> > > > 189 74032 1 0 07:53 ? 00:00:00
> > > > /usr/libexec/pacemaker/pengine
> > > >
> > > > root 75303 50092 0 07:55 pts/0 00:00:00 grep
> > > pacemaker
> > > > [root at SG-mysqlold-907 azureuser]# kill -9 74030
> > > > [root at SG-mysqlold-907 azureuser]# kill -9 74032
> > > > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> > > > root 75332 50092 0 07:55 pts/0 00:00:00 grep
> > > pacemaker
> > > >
> > > > [root at SG-mysqlold-907 azureuser]# crm satus
> > > > ERROR: status: crm_mon (rc=107): Connection to cluster failed:
> > > > Transport endpoint is not connected
> > > > -------------------------------------------------------------
> > > ------
> > > > ----------------------------------------------------------
> > > >
> > > > However, this does not seem to be having any effect on the
> > > cluster
> > > > status from other nodes
> > > > -------------------------------------------------------------
> > > ------
> > > > --------------------------------------------------------
> > > >
> > > > [root at SG-mysqlold-909 azureuser]# crm status
> > > > Last updated: Thu Sep 27 07:56:17 2018 Last change:
> > > Thu Sep
> > > > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
> > > > Stack: classic openais (with plugin)
> > > > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0)
> > > -
> > > > partition with quorum
> > > > 3 nodes and 3 resources configured, 3 expected votes
> > > >
> > > > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
> > >
> > > It most definitely would make the node offline, and if fencing
> > > were
> > > configured, the rest of the cluster would fence the node to make
> > > sure
> > > it's safely down.
> > >
> > > I see you're using the old corosync 1 plugin. I suspect what
> > > happened
> > > in this case is that corosync noticed the plugin died and
> > > restarted it
> > > quickly enough that it had rejoined by the time you checked the
> > > status
> > > elsewhere.
> > >
> > > >
> > > > Full list of resources:
> > > >
> > > > Master/Slave Set: ms_mysql [p_mysql]
> > > > Masters: [ SG-mysqlold-909 ]
> > > > Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
> > > >
> > > >
> > > > [root at SG-mysqlold-908 azureuser]# crm status
> > > > Last updated: Thu Sep 27 07:56:08 2018 Last change:
> > > Thu Sep
> > > > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
> > > > Stack: classic openais (with plugin)
> > > > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0)
> > > -
> > > > partition with quorum
> > > > 3 nodes and 3 resources configured, 3 expected votes
> > > >
> > > > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
> > > >
> > > > Full list of resources:
> > > >
> > > > Master/Slave Set: ms_mysql [p_mysql]
> > > > Masters: [ SG-mysqlold-909 ]
> > > > Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
> > > >
> > > > -------------------------------------------------------------
> > > ------
> > > > ---------------------------------------------------
> > > >
> > > > I am bit surprised that other nodes are not able to detect that
> > > > pacemaker is down on one of the nodes - SG-mysqlold-907
> > > >
> > > > Even if I kill pacemaker on the node which is a DC - I observe
> > > the
> > > > same behavior with rest of the nodes not detecting that DC is
> > > down.
> > > >
> > > > Could some one explain what is the expected behavior in these
> > > cases ?
> > > >
> > > > I am using corosync 1.4.7 and pacemaker 1.1.14
> > > >
> > > > Thanks in advance
> > > > Prasad
> > > >
> > > > _______________________________________________
> > > > Users mailing list: Users at clusterlabs.org
> > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > >
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc
> > > ratch.
> > > > pdf
> > > > Bugs: http://bugs.clusterlabs.org
> > > _______________________________________________
> > > Users mailing list: Users at clusterlabs.org
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
> > > tch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > >
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list