[ClusterLabs] Understanding the behavior of pacemaker crash

Fri Sep 28 09:50:42 EDT 2018

On Fri, 2018-09-28 at 15:26 +0530, Prasad Nagaraj wrote:
> Hi Ken - Only if I turn off corosync on the node [ where I crashed
> pacemaker] other nodes are able to detect and put the node as
> OFFLINE.
> Do you have any other guidance or insights into this ?

Yes, corosync is the cluster membership layer -- if corosync is
successfully running, then the node is a member of the cluster.
Pacemaker's crmd provides a higher level of membership; typically, with
corosync but no crmd, the node shows up as "pending" in status. However
I am not sure how it worked with the old corosync plugin.

> 
> Thanks
> Prasad
> 
> On Thu, Sep 27, 2018 at 9:33 PM Prasad Nagaraj <prasad.nagaraj76 at gmai
> l.com> wrote:
> > Hi Ken - Thanks for the response. Pacemaker is still not running on
> > that node. So I am still wondering what could be the issue ? Any
> > other configurations or logs should I be sharing to understand this
> > more ?
> > 
> > Thanks!
> > 
> > On Thu, Sep 27, 2018 at 8:08 PM Ken Gaillot <kgaillot at redhat.com>
> > wrote:
> > > On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote:
> > > > Hello - I was trying to understand the behavior or cluster when
> > > > pacemaker crashes on one of the nodes. So I hard killed
> > > pacemakerd
> > > > and its related processes.
> > > > 
> > > > -------------------------------------------------------------
> > > ------
> > > > -------------------------------------
> > > > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> > > > root      74022      1  0 07:53 pts/0    00:00:00 pacemakerd
> > > > 189       74028  74022  0 07:53 ?        00:00:00
> > > > /usr/libexec/pacemaker/cib
> > > > root      74029  74022  0 07:53 ?        00:00:00
> > > > /usr/libexec/pacemaker/stonithd
> > > > root      74030  74022  0 07:53 ?        00:00:00
> > > > /usr/libexec/pacemaker/lrmd
> > > > 189       74031  74022  0 07:53 ?        00:00:00
> > > > /usr/libexec/pacemaker/attrd
> > > > 189       74032  74022  0 07:53 ?        00:00:00
> > > > /usr/libexec/pacemaker/pengine
> > > > 189       74033  74022  0 07:53 ?        00:00:00
> > > > /usr/libexec/pacemaker/crmd
> > > > 
> > > > root      75228  50092  0 07:54 pts/0    00:00:00 grep
> > > pacemaker
> > > > [root at SG-mysqlold-907 azureuser]# kill -9 74022
> > > > 
> > > > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> > > > root      74030      1  0 07:53 ?        00:00:00
> > > > /usr/libexec/pacemaker/lrmd
> > > > 189       74032      1  0 07:53 ?        00:00:00
> > > > /usr/libexec/pacemaker/pengine
> > > > 
> > > > root      75303  50092  0 07:55 pts/0    00:00:00 grep
> > > pacemaker
> > > > [root at SG-mysqlold-907 azureuser]# kill -9 74030
> > > > [root at SG-mysqlold-907 azureuser]# kill -9 74032
> > > > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> > > > root      75332  50092  0 07:55 pts/0    00:00:00 grep
> > > pacemaker
> > > > 
> > > > [root at SG-mysqlold-907 azureuser]# crm satus
> > > > ERROR: status: crm_mon (rc=107): Connection to cluster failed:
> > > > Transport endpoint is not connected
> > > > -------------------------------------------------------------
> > > ------
> > > > ----------------------------------------------------------
> > > > 
> > > > However, this does not seem to be having any effect on the
> > > cluster
> > > > status from other nodes
> > > > -------------------------------------------------------------
> > > ------
> > > > --------------------------------------------------------
> > > > 
> > > > [root at SG-mysqlold-909 azureuser]# crm status
> > > > Last updated: Thu Sep 27 07:56:17 2018          Last change:
> > > Thu Sep
> > > > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
> > > > Stack: classic openais (with plugin)
> > > > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0)
> > > -
> > > > partition with quorum
> > > > 3 nodes and 3 resources configured, 3 expected votes
> > > > 
> > > > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
> > > 
> > > It most definitely would make the node offline, and if fencing
> > > were
> > > configured, the rest of the cluster would fence the node to make
> > > sure
> > > it's safely down.
> > > 
> > > I see you're using the old corosync 1 plugin. I suspect what
> > > happened
> > > in this case is that corosync noticed the plugin died and
> > > restarted it
> > > quickly enough that it had rejoined by the time you checked the
> > > status
> > > elsewhere.
> > > 
> > > > 
> > > > Full list of resources:
> > > > 
> > > >  Master/Slave Set: ms_mysql [p_mysql]
> > > >      Masters: [ SG-mysqlold-909 ]
> > > >      Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
> > > > 
> > > > 
> > > > [root at SG-mysqlold-908 azureuser]# crm status
> > > > Last updated: Thu Sep 27 07:56:08 2018          Last change:
> > > Thu Sep
> > > > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
> > > > Stack: classic openais (with plugin)
> > > > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0)
> > > -
> > > > partition with quorum
> > > > 3 nodes and 3 resources configured, 3 expected votes
> > > > 
> > > > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
> > > > 
> > > > Full list of resources:
> > > > 
> > > >  Master/Slave Set: ms_mysql [p_mysql]
> > > >      Masters: [ SG-mysqlold-909 ]
> > > >      Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
> > > > 
> > > > -------------------------------------------------------------
> > > ------
> > > > ---------------------------------------------------
> > > > 
> > > > I am bit surprised that other nodes are not able to detect that
> > > > pacemaker is down on one of the nodes - SG-mysqlold-907 
> > > > 
> > > > Even if I kill pacemaker on the node which is a DC - I observe
> > > the
> > > > same behavior with rest of the nodes not detecting that DC is
> > > down. 
> > > > 
> > > > Could some one explain what is the expected behavior in these
> > > cases ?
> > > >  
> > > > I am using corosync 1.4.7 and pacemaker 1.1.14
> > > > 
> > > > Thanks in advance
> > > > Prasad
> > > > 
> > > > _______________________________________________
> > > > Users mailing list: Users at clusterlabs.org
> > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > > 
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc
> > > ratch.
> > > > pdf
> > > > Bugs: http://bugs.clusterlabs.org
> > > _______________________________________________
> > > Users mailing list: Users at clusterlabs.org
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > 
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
> > > tch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > > 
-- 
Ken Gaillot <kgaillot at redhat.com>