[ClusterLabs] Understanding the behavior of pacemaker crash

Thu Sep 27 16:03:41 UTC 2018

Hi Ken - Thanks for the response. Pacemaker is still not running on that
node. So I am still wondering what could be the issue ? Any other
configurations or logs should I be sharing to understand this more ?

Thanks!

On Thu, Sep 27, 2018 at 8:08 PM Ken Gaillot <kgaillot at redhat.com> wrote:

> On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote:
> > Hello - I was trying to understand the behavior or cluster when
> > pacemaker crashes on one of the nodes. So I hard killed pacemakerd
> > and its related processes.
> >
> > -------------------------------------------------------------------
> > -------------------------------------
> > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> > root      74022      1  0 07:53 pts/0    00:00:00 pacemakerd
> > 189       74028  74022  0 07:53 ?        00:00:00
> > /usr/libexec/pacemaker/cib
> > root      74029  74022  0 07:53 ?        00:00:00
> > /usr/libexec/pacemaker/stonithd
> > root      74030  74022  0 07:53 ?        00:00:00
> > /usr/libexec/pacemaker/lrmd
> > 189       74031  74022  0 07:53 ?        00:00:00
> > /usr/libexec/pacemaker/attrd
> > 189       74032  74022  0 07:53 ?        00:00:00
> > /usr/libexec/pacemaker/pengine
> > 189       74033  74022  0 07:53 ?        00:00:00
> > /usr/libexec/pacemaker/crmd
> >
> > root      75228  50092  0 07:54 pts/0    00:00:00 grep pacemaker
> > [root at SG-mysqlold-907 azureuser]# kill -9 74022
> >
> > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> > root      74030      1  0 07:53 ?        00:00:00
> > /usr/libexec/pacemaker/lrmd
> > 189       74032      1  0 07:53 ?        00:00:00
> > /usr/libexec/pacemaker/pengine
> >
> > root      75303  50092  0 07:55 pts/0    00:00:00 grep pacemaker
> > [root at SG-mysqlold-907 azureuser]# kill -9 74030
> > [root at SG-mysqlold-907 azureuser]# kill -9 74032
> > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> > root      75332  50092  0 07:55 pts/0    00:00:00 grep pacemaker
> >
> > [root at SG-mysqlold-907 azureuser]# crm satus
> > ERROR: status: crm_mon (rc=107): Connection to cluster failed:
> > Transport endpoint is not connected
> > -------------------------------------------------------------------
> > ----------------------------------------------------------
> >
> > However, this does not seem to be having any effect on the cluster
> > status from other nodes
> > -------------------------------------------------------------------
> > --------------------------------------------------------
> >
> > [root at SG-mysqlold-909 azureuser]# crm status
> > Last updated: Thu Sep 27 07:56:17 2018          Last change: Thu Sep
> > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
> > Stack: classic openais (with plugin)
> > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) -
> > partition with quorum
> > 3 nodes and 3 resources configured, 3 expected votes
> >
> > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
>
> It most definitely would make the node offline, and if fencing were
> configured, the rest of the cluster would fence the node to make sure
> it's safely down.
>
> I see you're using the old corosync 1 plugin. I suspect what happened
> in this case is that corosync noticed the plugin died and restarted it
> quickly enough that it had rejoined by the time you checked the status
> elsewhere.
>
> >
> > Full list of resources:
> >
> >  Master/Slave Set: ms_mysql [p_mysql]
> >      Masters: [ SG-mysqlold-909 ]
> >      Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
> >
> >
> > [root at SG-mysqlold-908 azureuser]# crm status
> > Last updated: Thu Sep 27 07:56:08 2018          Last change: Thu Sep
> > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
> > Stack: classic openais (with plugin)
> > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) -
> > partition with quorum
> > 3 nodes and 3 resources configured, 3 expected votes
> >
> > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
> >
> > Full list of resources:
> >
> >  Master/Slave Set: ms_mysql [p_mysql]
> >      Masters: [ SG-mysqlold-909 ]
> >      Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
> >
> > -------------------------------------------------------------------
> > ---------------------------------------------------
> >
> > I am bit surprised that other nodes are not able to detect that
> > pacemaker is down on one of the nodes - SG-mysqlold-907
> >
> > Even if I kill pacemaker on the node which is a DC - I observe the
> > same behavior with rest of the nodes not detecting that DC is down.
> >
> > Could some one explain what is the expected behavior in these cases ?
> >
> > I am using corosync 1.4.7 and pacemaker 1.1.14
> >
> > Thanks in advance
> > Prasad
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> > pdf
> > Bugs: http://bugs.clusterlabs.org
> --
> Ken Gaillot <kgaillot at redhat.com>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180927/edb87510/attachment.html>