<div dir="ltr">Hi Ken - Only if I turn off corosync on the node [ where I crashed pacemaker] other nodes are able to detect and put the node as OFFLINE.<div>Do you have any other guidance or insights into this ?</div><div><br></div><div>Thanks</div><div>Prasad</div></div><br><div class="gmail_quote"><div dir="ltr">On Thu, Sep 27, 2018 at 9:33 PM Prasad Nagaraj <<a href="mailto:prasad.nagaraj76@gmail.com">prasad.nagaraj76@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Ken - Thanks for the response. Pacemaker is still not running on that node. So I am still wondering what could be the issue ? Any other configurations or logs should I be sharing to understand this more ?<div><br></div><div>Thanks!</div></div><br><div class="gmail_quote"><div dir="ltr">On Thu, Sep 27, 2018 at 8:08 PM Ken Gaillot <<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote:<br>

> Hello - I was trying to understand the behavior or cluster when<br>

> pacemaker crashes on one of the nodes. So I hard killed pacemakerd<br>

> and its related processes.<br>

> <br>

> -------------------------------------------------------------------<br>

> -------------------------------------<br>

> [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker<br>

> root      74022      1  0 07:53 pts/0    00:00:00 pacemakerd<br>

> 189       74028  74022  0 07:53 ?        00:00:00<br>

> /usr/libexec/pacemaker/cib<br>

> root      74029  74022  0 07:53 ?        00:00:00<br>

> /usr/libexec/pacemaker/stonithd<br>

> root      74030  74022  0 07:53 ?        00:00:00<br>

> /usr/libexec/pacemaker/lrmd<br>

> 189       74031  74022  0 07:53 ?        00:00:00<br>

> /usr/libexec/pacemaker/attrd<br>

> 189       74032  74022  0 07:53 ?        00:00:00<br>

> /usr/libexec/pacemaker/pengine<br>

> 189       74033  74022  0 07:53 ?        00:00:00<br>

> /usr/libexec/pacemaker/crmd<br>

> <br>

> root      75228  50092  0 07:54 pts/0    00:00:00 grep pacemaker<br>

> [root@SG-mysqlold-907 azureuser]# kill -9 74022<br>

> <br>

> [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker<br>

> root      74030      1  0 07:53 ?        00:00:00<br>

> /usr/libexec/pacemaker/lrmd<br>

> 189       74032      1  0 07:53 ?        00:00:00<br>

> /usr/libexec/pacemaker/pengine<br>

> <br>

> root      75303  50092  0 07:55 pts/0    00:00:00 grep pacemaker<br>

> [root@SG-mysqlold-907 azureuser]# kill -9 74030<br>

> [root@SG-mysqlold-907 azureuser]# kill -9 74032<br>

> [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker<br>

> root      75332  50092  0 07:55 pts/0    00:00:00 grep pacemaker<br>

> <br>

> [root@SG-mysqlold-907 azureuser]# crm satus<br>

> ERROR: status: crm_mon (rc=107): Connection to cluster failed:<br>

> Transport endpoint is not connected<br>

> -------------------------------------------------------------------<br>

> ----------------------------------------------------------<br>

> <br>

> However, this does not seem to be having any effect on the cluster<br>

> status from other nodes<br>

> -------------------------------------------------------------------<br>

> --------------------------------------------------------<br>

> <br>

> [root@SG-mysqlold-909 azureuser]# crm status<br>

> Last updated: Thu Sep 27 07:56:17 2018          Last change: Thu Sep<br>

> 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909<br>

> Stack: classic openais (with plugin)<br>

> Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) -<br>

> partition with quorum<br>

> 3 nodes and 3 resources configured, 3 expected votes<br>

> <br>

> Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]<br>

<br>

It most definitely would make the node offline, and if fencing were<br>

configured, the rest of the cluster would fence the node to make sure<br>

it's safely down.<br>

<br>

I see you're using the old corosync 1 plugin. I suspect what happened<br>

in this case is that corosync noticed the plugin died and restarted it<br>

quickly enough that it had rejoined by the time you checked the status<br>

elsewhere.<br>

<br>

> <br>

> Full list of resources:<br>

> <br>

>  Master/Slave Set: ms_mysql [p_mysql]<br>

>      Masters: [ SG-mysqlold-909 ]<br>

>      Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]<br>

> <br>

> <br>

> [root@SG-mysqlold-908 azureuser]# crm status<br>

> Last updated: Thu Sep 27 07:56:08 2018          Last change: Thu Sep<br>

> 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909<br>

> Stack: classic openais (with plugin)<br>

> Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) -<br>

> partition with quorum<br>

> 3 nodes and 3 resources configured, 3 expected votes<br>

> <br>

> Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]<br>

> <br>

> Full list of resources:<br>

> <br>

>  Master/Slave Set: ms_mysql [p_mysql]<br>

>      Masters: [ SG-mysqlold-909 ]<br>

>      Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]<br>

> <br>

> -------------------------------------------------------------------<br>

> ---------------------------------------------------<br>

> <br>

> I am bit surprised that other nodes are not able to detect that<br>

> pacemaker is down on one of the nodes - SG-mysqlold-907 <br>

> <br>

> Even if I kill pacemaker on the node which is a DC - I observe the<br>

> same behavior with rest of the nodes not detecting that DC is down. <br>

> <br>

> Could some one explain what is the expected behavior in these cases ?<br>

>  <br>

> I am using corosync 1.4.7 and pacemaker 1.1.14<br>

> <br>

> Thanks in advance<br>

> Prasad<br>

> <br>

> _______________________________________________<br>

> Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>

> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

> <br>

> Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

> Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch</a>.<br>

> pdf<br>

> Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

-- <br>

Ken Gaillot <<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>><br>

_______________________________________________<br>

Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

</blockquote></div>

</blockquote></div>