[ClusterLabs] Antw: Re: Understanding the behavior of pacemaker crash

Mon Oct 1 03:00:39 EDT 2018

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 28.09.2018 um 15:50 in Nachricht
<1538142642.4679.1.camel at redhat.com>:
> On Fri, 2018-09-28 at 15:26 +0530, Prasad Nagaraj wrote:
>> Hi Ken - Only if I turn off corosync on the node [ where I crashed
>> pacemaker] other nodes are able to detect and put the node as
>> OFFLINE.
>> Do you have any other guidance or insights into this ?
> 
> Yes, corosync is the cluster membership layer -- if corosync is
> successfully running, then the node is a member of the cluster.
> Pacemaker's crmd provides a higher level of membership; typically, with
> corosync but no crmd, the node shows up as "pending" in status. However
> I am not sure how it worked with the old corosync plugin.

Maybe crmd should "feed a watchdog with tranquilizers" (meaning if it stops to do so, the watchdog will become alive and reset the node). ;-)

Regards,
Ulrich

> 
>> 
>> Thanks
>> Prasad
>> 
>> On Thu, Sep 27, 2018 at 9:33 PM Prasad Nagaraj <prasad.nagaraj76 at gmai
>> l.com> wrote:
>> > Hi Ken - Thanks for the response. Pacemaker is still not running on
>> > that node. So I am still wondering what could be the issue ? Any
>> > other configurations or logs should I be sharing to understand this
>> > more ?
>> > 
>> > Thanks!
>> > 
>> > On Thu, Sep 27, 2018 at 8:08 PM Ken Gaillot <kgaillot at redhat.com>
>> > wrote:
>> > > On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote:
>> > > > Hello - I was trying to understand the behavior or cluster when
>> > > > pacemaker crashes on one of the nodes. So I hard killed
>> > > pacemakerd
>> > > > and its related processes.
>> > > > 
>> > > > -------------------------------------------------------------
>> > > ------
>> > > > -------------------------------------
>> > > > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
>> > > > root      74022      1  0 07:53 pts/0    00:00:00 pacemakerd
>> > > > 189       74028  74022  0 07:53 ?        00:00:00
>> > > > /usr/libexec/pacemaker/cib
>> > > > root      74029  74022  0 07:53 ?        00:00:00
>> > > > /usr/libexec/pacemaker/stonithd
>> > > > root      74030  74022  0 07:53 ?        00:00:00
>> > > > /usr/libexec/pacemaker/lrmd
>> > > > 189       74031  74022  0 07:53 ?        00:00:00
>> > > > /usr/libexec/pacemaker/attrd
>> > > > 189       74032  74022  0 07:53 ?        00:00:00
>> > > > /usr/libexec/pacemaker/pengine
>> > > > 189       74033  74022  0 07:53 ?        00:00:00
>> > > > /usr/libexec/pacemaker/crmd
>> > > > 
>> > > > root      75228  50092  0 07:54 pts/0    00:00:00 grep
>> > > pacemaker
>> > > > [root at SG-mysqlold-907 azureuser]# kill -9 74022
>> > > > 
>> > > > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
>> > > > root      74030      1  0 07:53 ?        00:00:00
>> > > > /usr/libexec/pacemaker/lrmd
>> > > > 189       74032      1  0 07:53 ?        00:00:00
>> > > > /usr/libexec/pacemaker/pengine
>> > > > 
>> > > > root      75303  50092  0 07:55 pts/0    00:00:00 grep
>> > > pacemaker
>> > > > [root at SG-mysqlold-907 azureuser]# kill -9 74030
>> > > > [root at SG-mysqlold-907 azureuser]# kill -9 74032
>> > > > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
>> > > > root      75332  50092  0 07:55 pts/0    00:00:00 grep
>> > > pacemaker
>> > > > 
>> > > > [root at SG-mysqlold-907 azureuser]# crm satus
>> > > > ERROR: status: crm_mon (rc=107): Connection to cluster failed:
>> > > > Transport endpoint is not connected
>> > > > -------------------------------------------------------------
>> > > ------
>> > > > ----------------------------------------------------------
>> > > > 
>> > > > However, this does not seem to be having any effect on the
>> > > cluster
>> > > > status from other nodes
>> > > > -------------------------------------------------------------
>> > > ------
>> > > > --------------------------------------------------------
>> > > > 
>> > > > [root at SG-mysqlold-909 azureuser]# crm status
>> > > > Last updated: Thu Sep 27 07:56:17 2018          Last change:
>> > > Thu Sep
>> > > > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
>> > > > Stack: classic openais (with plugin)
>> > > > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0)
>> > > -
>> > > > partition with quorum
>> > > > 3 nodes and 3 resources configured, 3 expected votes
>> > > > 
>> > > > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
>> > > 
>> > > It most definitely would make the node offline, and if fencing
>> > > were
>> > > configured, the rest of the cluster would fence the node to make
>> > > sure
>> > > it's safely down.
>> > > 
>> > > I see you're using the old corosync 1 plugin. I suspect what
>> > > happened
>> > > in this case is that corosync noticed the plugin died and
>> > > restarted it
>> > > quickly enough that it had rejoined by the time you checked the
>> > > status
>> > > elsewhere.
>> > > 
>> > > > 
>> > > > Full list of resources:
>> > > > 
>> > > >  Master/Slave Set: ms_mysql [p_mysql]
>> > > >      Masters: [ SG-mysqlold-909 ]
>> > > >      Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
>> > > > 
>> > > > 
>> > > > [root at SG-mysqlold-908 azureuser]# crm status
>> > > > Last updated: Thu Sep 27 07:56:08 2018          Last change:
>> > > Thu Sep
>> > > > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
>> > > > Stack: classic openais (with plugin)
>> > > > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0)
>> > > -
>> > > > partition with quorum
>> > > > 3 nodes and 3 resources configured, 3 expected votes
>> > > > 
>> > > > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
>> > > > 
>> > > > Full list of resources:
>> > > > 
>> > > >  Master/Slave Set: ms_mysql [p_mysql]
>> > > >      Masters: [ SG-mysqlold-909 ]
>> > > >      Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
>> > > > 
>> > > > -------------------------------------------------------------
>> > > ------
>> > > > ---------------------------------------------------
>> > > > 
>> > > > I am bit surprised that other nodes are not able to detect that
>> > > > pacemaker is down on one of the nodes - SG-mysqlold-907 
>> > > > 
>> > > > Even if I kill pacemaker on the node which is a DC - I observe
>> > > the
>> > > > same behavior with rest of the nodes not detecting that DC is
>> > > down. 
>> > > > 
>> > > > Could some one explain what is the expected behavior in these
>> > > cases ?
>> > > >  
>> > > > I am using corosync 1.4.7 and pacemaker 1.1.14
>> > > > 
>> > > > Thanks in advance
>> > > > Prasad
>> > > > 
>> > > > _______________________________________________
>> > > > Users mailing list: Users at clusterlabs.org 
>> > > > https://lists.clusterlabs.org/mailman/listinfo/users 
>> > > > 
>> > > > Project Home: http://www.clusterlabs.org 
>> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc 
>> > > ratch.
>> > > > pdf
>> > > > Bugs: http://bugs.clusterlabs.org 
>> > > _______________________________________________
>> > > Users mailing list: Users at clusterlabs.org 
>> > > https://lists.clusterlabs.org/mailman/listinfo/users 
>> > > 
>> > > Project Home: http://www.clusterlabs.org 
>> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra 
>> > > tch.pdf
>> > > Bugs: http://bugs.clusterlabs.org 
>> > > 
> -- 
> Ken Gaillot <kgaillot at redhat.com>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org