[ClusterLabs] Understanding the behavior of pacemaker crash

Thu Sep 27 10:38:10 EDT 2018

On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote:
> Hello - I was trying to understand the behavior or cluster when
> pacemaker crashes on one of the nodes. So I hard killed pacemakerd
> and its related processes.
> 
> -------------------------------------------------------------------
> -------------------------------------
> [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> root      74022      1  0 07:53 pts/0    00:00:00 pacemakerd
> 189       74028  74022  0 07:53 ?        00:00:00
> /usr/libexec/pacemaker/cib
> root      74029  74022  0 07:53 ?        00:00:00
> /usr/libexec/pacemaker/stonithd
> root      74030  74022  0 07:53 ?        00:00:00
> /usr/libexec/pacemaker/lrmd
> 189       74031  74022  0 07:53 ?        00:00:00
> /usr/libexec/pacemaker/attrd
> 189       74032  74022  0 07:53 ?        00:00:00
> /usr/libexec/pacemaker/pengine
> 189       74033  74022  0 07:53 ?        00:00:00
> /usr/libexec/pacemaker/crmd
> 
> root      75228  50092  0 07:54 pts/0    00:00:00 grep pacemaker
> [root at SG-mysqlold-907 azureuser]# kill -9 74022
> 
> [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> root      74030      1  0 07:53 ?        00:00:00
> /usr/libexec/pacemaker/lrmd
> 189       74032      1  0 07:53 ?        00:00:00
> /usr/libexec/pacemaker/pengine
> 
> root      75303  50092  0 07:55 pts/0    00:00:00 grep pacemaker
> [root at SG-mysqlold-907 azureuser]# kill -9 74030
> [root at SG-mysqlold-907 azureuser]# kill -9 74032
> [root at SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> root      75332  50092  0 07:55 pts/0    00:00:00 grep pacemaker
> 
> [root at SG-mysqlold-907 azureuser]# crm satus
> ERROR: status: crm_mon (rc=107): Connection to cluster failed:
> Transport endpoint is not connected
> -------------------------------------------------------------------
> ----------------------------------------------------------
> 
> However, this does not seem to be having any effect on the cluster
> status from other nodes
> -------------------------------------------------------------------
> --------------------------------------------------------
> 
> [root at SG-mysqlold-909 azureuser]# crm status
> Last updated: Thu Sep 27 07:56:17 2018          Last change: Thu Sep
> 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
> Stack: classic openais (with plugin)
> Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) -
> partition with quorum
> 3 nodes and 3 resources configured, 3 expected votes
> 
> Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]

It most definitely would make the node offline, and if fencing were
configured, the rest of the cluster would fence the node to make sure
it's safely down.

I see you're using the old corosync 1 plugin. I suspect what happened
in this case is that corosync noticed the plugin died and restarted it
quickly enough that it had rejoined by the time you checked the status
elsewhere.

> 
> Full list of resources:
> 
>  Master/Slave Set: ms_mysql [p_mysql]
>      Masters: [ SG-mysqlold-909 ]
>      Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
> 
> 
> [root at SG-mysqlold-908 azureuser]# crm status
> Last updated: Thu Sep 27 07:56:08 2018          Last change: Thu Sep
> 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
> Stack: classic openais (with plugin)
> Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) -
> partition with quorum
> 3 nodes and 3 resources configured, 3 expected votes
> 
> Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
> 
> Full list of resources:
> 
>  Master/Slave Set: ms_mysql [p_mysql]
>      Masters: [ SG-mysqlold-909 ]
>      Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
> 
> -------------------------------------------------------------------
> ---------------------------------------------------
> 
> I am bit surprised that other nodes are not able to detect that
> pacemaker is down on one of the nodes - SG-mysqlold-907 
> 
> Even if I kill pacemaker on the node which is a DC - I observe the
> same behavior with rest of the nodes not detecting that DC is down. 
> 
> Could some one explain what is the expected behavior in these cases ?
>  
> I am using corosync 1.4.7 and pacemaker 1.1.14
> 
> Thanks in advance
> Prasad
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>