[ClusterLabs] Understanding the behavior of pacemaker crash
Prasad Nagaraj
prasad.nagaraj76 at gmail.com
Fri Sep 28 05:56:21 EDT 2018
Hi Ken - Only if I turn off corosync on the node [ where I crashed
pacemaker] other nodes are able to detect and put the node as OFFLINE.
Do you have any other guidance or insights into this ?
Thanks
Prasad
On Thu, Sep 27, 2018 at 9:33 PM Prasad Nagaraj <prasad.nagaraj76 at gmail.com>
wrote:
> Hi Ken - Thanks for the response. Pacemaker is still not running on that
> node. So I am still wondering what could be the issue ? Any other
> configurations or logs should I be sharing to understand this more ?
>
> Thanks!
>
> On Thu, Sep 27, 2018 at 8:08 PM Ken Gaillot <kgaillot at redhat.com> wrote:
>
>> On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote:
>> > Hello - I was trying to understand the behavior or cluster when
>> > pacemaker crashes on one of the nodes. So I hard killed pacemakerd
>> > and its related processes.
>> >
>> > -------------------------------------------------------------------
>> > -------------------------------------
>> > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
>> > root 74022 1 0 07:53 pts/0 00:00:00 pacemakerd
>> > 189 74028 74022 0 07:53 ? 00:00:00
>> > /usr/libexec/pacemaker/cib
>> > root 74029 74022 0 07:53 ? 00:00:00
>> > /usr/libexec/pacemaker/stonithd
>> > root 74030 74022 0 07:53 ? 00:00:00
>> > /usr/libexec/pacemaker/lrmd
>> > 189 74031 74022 0 07:53 ? 00:00:00
>> > /usr/libexec/pacemaker/attrd
>> > 189 74032 74022 0 07:53 ? 00:00:00
>> > /usr/libexec/pacemaker/pengine
>> > 189 74033 74022 0 07:53 ? 00:00:00
>> > /usr/libexec/pacemaker/crmd
>> >
>> > root 75228 50092 0 07:54 pts/0 00:00:00 grep pacemaker
>> > [root at SG-mysqlold-907 azureuser]# kill -9 74022
>> >
>> > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
>> > root 74030 1 0 07:53 ? 00:00:00
>> > /usr/libexec/pacemaker/lrmd
>> > 189 74032 1 0 07:53 ? 00:00:00
>> > /usr/libexec/pacemaker/pengine
>> >
>> > root 75303 50092 0 07:55 pts/0 00:00:00 grep pacemaker
>> > [root at SG-mysqlold-907 azureuser]# kill -9 74030
>> > [root at SG-mysqlold-907 azureuser]# kill -9 74032
>> > [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
>> > root 75332 50092 0 07:55 pts/0 00:00:00 grep pacemaker
>> >
>> > [root at SG-mysqlold-907 azureuser]# crm satus
>> > ERROR: status: crm_mon (rc=107): Connection to cluster failed:
>> > Transport endpoint is not connected
>> > -------------------------------------------------------------------
>> > ----------------------------------------------------------
>> >
>> > However, this does not seem to be having any effect on the cluster
>> > status from other nodes
>> > -------------------------------------------------------------------
>> > --------------------------------------------------------
>> >
>> > [root at SG-mysqlold-909 azureuser]# crm status
>> > Last updated: Thu Sep 27 07:56:17 2018 Last change: Thu Sep
>> > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
>> > Stack: classic openais (with plugin)
>> > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) -
>> > partition with quorum
>> > 3 nodes and 3 resources configured, 3 expected votes
>> >
>> > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
>>
>> It most definitely would make the node offline, and if fencing were
>> configured, the rest of the cluster would fence the node to make sure
>> it's safely down.
>>
>> I see you're using the old corosync 1 plugin. I suspect what happened
>> in this case is that corosync noticed the plugin died and restarted it
>> quickly enough that it had rejoined by the time you checked the status
>> elsewhere.
>>
>> >
>> > Full list of resources:
>> >
>> > Master/Slave Set: ms_mysql [p_mysql]
>> > Masters: [ SG-mysqlold-909 ]
>> > Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
>> >
>> >
>> > [root at SG-mysqlold-908 azureuser]# crm status
>> > Last updated: Thu Sep 27 07:56:08 2018 Last change: Thu Sep
>> > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
>> > Stack: classic openais (with plugin)
>> > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) -
>> > partition with quorum
>> > 3 nodes and 3 resources configured, 3 expected votes
>> >
>> > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
>> >
>> > Full list of resources:
>> >
>> > Master/Slave Set: ms_mysql [p_mysql]
>> > Masters: [ SG-mysqlold-909 ]
>> > Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
>> >
>> > -------------------------------------------------------------------
>> > ---------------------------------------------------
>> >
>> > I am bit surprised that other nodes are not able to detect that
>> > pacemaker is down on one of the nodes - SG-mysqlold-907
>> >
>> > Even if I kill pacemaker on the node which is a DC - I observe the
>> > same behavior with rest of the nodes not detecting that DC is down.
>> >
>> > Could some one explain what is the expected behavior in these cases ?
>> >
>> > I am using corosync 1.4.7 and pacemaker 1.1.14
>> >
>> > Thanks in advance
>> > Prasad
>> >
>> > _______________________________________________
>> > Users mailing list: Users at clusterlabs.org
>> > https://lists.clusterlabs.org/mailman/listinfo/users
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> > pdf
>> > Bugs: http://bugs.clusterlabs.org
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180928/fb29207d/attachment-0002.html>
More information about the Users
mailing list