[ClusterLabs] Antw: Re: Understanding the behavior of pacemaker crash
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Mon Oct 1 03:00:39 EDT 2018
>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 28.09.2018 um 15:50 in Nachricht
<1538142642.4679.1.camel at redhat.com>:
> On Fri, 2018-09-28 at 15:26 +0530, Prasad Nagaraj wrote:
>> Hi Ken - Only if I turn off corosync on the node [ where I crashed
>> pacemaker] other nodes are able to detect and put the node as
>> OFFLINE.
>> Do you have any other guidance or insights into this ?
>
> Yes, corosync is the cluster membership layer -- if corosync is
> successfully running, then the node is a member of the cluster.
> Pacemaker's crmd provides a higher level of membership; typically, with
> corosync but no crmd, the node shows up as "pending" in status. However
> I am not sure how it worked with the old corosync plugin.
Maybe crmd should "feed a watchdog with tranquilizers" (meaning if it stops to do so, the watchdog will become alive and reset the node). ;-)
Regards,
Ulrich
>
>>
>> Thanks
>> Prasad
>>
>> On Thu, Sep 27, 2018 at 9:33 PM Prasad Nagaraj <prasad.nagaraj76 at gmai
>> l.com> wrote:
>> > Hi Ken - Thanks for the response. Pacemaker is still not running on
>> > that node. So I am still wondering what could be the issue ? Any
>> > other configurations or logs should I be sharing to understand this
>> > more ?
>> >
>> > Thanks!
>> >
>> > On Thu, Sep 27, 2018 at 8:08 PM Ken Gaillot <kgaillot at redhat.com>
>> > wrote:
>> > > On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote:
>> > > > Hello - I was trying to understand the behavior or cluster when
>> > > > pacemaker crashes on one of the nodes. So I hard killed
>> > > pacemakerd
>> > > > and its related processes.
>> > > >
>> > > > -------------------------------------------------------------
>> > > ------
>> > > > -------------------------------------
>> > > > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
>> > > > root 74022 1 0 07:53 pts/0 00:00:00 pacemakerd
>> > > > 189 74028 74022 0 07:53 ? 00:00:00
>> > > > /usr/libexec/pacemaker/cib
>> > > > root 74029 74022 0 07:53 ? 00:00:00
>> > > > /usr/libexec/pacemaker/stonithd
>> > > > root 74030 74022 0 07:53 ? 00:00:00
>> > > > /usr/libexec/pacemaker/lrmd
>> > > > 189 74031 74022 0 07:53 ? 00:00:00
>> > > > /usr/libexec/pacemaker/attrd
>> > > > 189 74032 74022 0 07:53 ? 00:00:00
>> > > > /usr/libexec/pacemaker/pengine
>> > > > 189 74033 74022 0 07:53 ? 00:00:00
>> > > > /usr/libexec/pacemaker/crmd
>> > > >
>> > > > root 75228 50092 0 07:54 pts/0 00:00:00 grep
>> > > pacemaker
>> > > > [root at SG-mysqlold-907 azureuser]# kill -9 74022
>> > > >
>> > > > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
>> > > > root 74030 1 0 07:53 ? 00:00:00
>> > > > /usr/libexec/pacemaker/lrmd
>> > > > 189 74032 1 0 07:53 ? 00:00:00
>> > > > /usr/libexec/pacemaker/pengine
>> > > >
>> > > > root 75303 50092 0 07:55 pts/0 00:00:00 grep
>> > > pacemaker
>> > > > [root at SG-mysqlold-907 azureuser]# kill -9 74030
>> > > > [root at SG-mysqlold-907 azureuser]# kill -9 74032
>> > > > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
>> > > > root 75332 50092 0 07:55 pts/0 00:00:00 grep
>> > > pacemaker
>> > > >
>> > > > [root at SG-mysqlold-907 azureuser]# crm satus
>> > > > ERROR: status: crm_mon (rc=107): Connection to cluster failed:
>> > > > Transport endpoint is not connected
>> > > > -------------------------------------------------------------
>> > > ------
>> > > > ----------------------------------------------------------
>> > > >
>> > > > However, this does not seem to be having any effect on the
>> > > cluster
>> > > > status from other nodes
>> > > > -------------------------------------------------------------
>> > > ------
>> > > > --------------------------------------------------------
>> > > >
>> > > > [root at SG-mysqlold-909 azureuser]# crm status
>> > > > Last updated: Thu Sep 27 07:56:17 2018 Last change:
>> > > Thu Sep
>> > > > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
>> > > > Stack: classic openais (with plugin)
>> > > > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0)
>> > > -
>> > > > partition with quorum
>> > > > 3 nodes and 3 resources configured, 3 expected votes
>> > > >
>> > > > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
>> > >
>> > > It most definitely would make the node offline, and if fencing
>> > > were
>> > > configured, the rest of the cluster would fence the node to make
>> > > sure
>> > > it's safely down.
>> > >
>> > > I see you're using the old corosync 1 plugin. I suspect what
>> > > happened
>> > > in this case is that corosync noticed the plugin died and
>> > > restarted it
>> > > quickly enough that it had rejoined by the time you checked the
>> > > status
>> > > elsewhere.
>> > >
>> > > >
>> > > > Full list of resources:
>> > > >
>> > > > Master/Slave Set: ms_mysql [p_mysql]
>> > > > Masters: [ SG-mysqlold-909 ]
>> > > > Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
>> > > >
>> > > >
>> > > > [root at SG-mysqlold-908 azureuser]# crm status
>> > > > Last updated: Thu Sep 27 07:56:08 2018 Last change:
>> > > Thu Sep
>> > > > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
>> > > > Stack: classic openais (with plugin)
>> > > > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0)
>> > > -
>> > > > partition with quorum
>> > > > 3 nodes and 3 resources configured, 3 expected votes
>> > > >
>> > > > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
>> > > >
>> > > > Full list of resources:
>> > > >
>> > > > Master/Slave Set: ms_mysql [p_mysql]
>> > > > Masters: [ SG-mysqlold-909 ]
>> > > > Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
>> > > >
>> > > > -------------------------------------------------------------
>> > > ------
>> > > > ---------------------------------------------------
>> > > >
>> > > > I am bit surprised that other nodes are not able to detect that
>> > > > pacemaker is down on one of the nodes - SG-mysqlold-907
>> > > >
>> > > > Even if I kill pacemaker on the node which is a DC - I observe
>> > > the
>> > > > same behavior with rest of the nodes not detecting that DC is
>> > > down.
>> > > >
>> > > > Could some one explain what is the expected behavior in these
>> > > cases ?
>> > > >
>> > > > I am using corosync 1.4.7 and pacemaker 1.1.14
>> > > >
>> > > > Thanks in advance
>> > > > Prasad
>> > > >
>> > > > _______________________________________________
>> > > > Users mailing list: Users at clusterlabs.org
>> > > > https://lists.clusterlabs.org/mailman/listinfo/users
>> > > >
>> > > > Project Home: http://www.clusterlabs.org
>> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc
>> > > ratch.
>> > > > pdf
>> > > > Bugs: http://bugs.clusterlabs.org
>> > > _______________________________________________
>> > > Users mailing list: Users at clusterlabs.org
>> > > https://lists.clusterlabs.org/mailman/listinfo/users
>> > >
>> > > Project Home: http://www.clusterlabs.org
>> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
>> > > tch.pdf
>> > > Bugs: http://bugs.clusterlabs.org
>> > >
> --
> Ken Gaillot <kgaillot at redhat.com>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list