<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">

</head>

<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-family: Calibri, sans-serif;">

<div>Hi,</div>

<div>We ran into some problems when we pull down the ethernet interface using “ifconfig eth0 down”</div>

<div><br>

</div>

<div>Our cluster has the following configurations and resources</div>

<ul>

<li>Two  network interfaces : eth0 and lo(cal)</li><li>3 nodes with one node put in maintenance mode</li><li>No-quorum-policy=stop</li><li>Stonith-enabled=false</li><li>Postgresql Master/Slave</li><li>vip master and vip replication IPs </li><li>VIPs will run on the node where Postgresql Master is running</li></ul>

<div><br>

</div>

<div>Two test cases that we executed are as follows</div>

<ul>

<li>Introduce delay in the ethernet interface o f the postgresql PRIMARY node  (Command  : tc qdisc add dev eth0 root netem delay 8000ms)</li><li>`Ifconfig eth0 down` on the postgresql PRIMARY Node</li><li>We expected that both these test cases test for network problems in the cluster</li></ul>

<div><br>

</div>

<div>In the first case (ethernet interface delay) </div>

<ul>

<li>Cluster is divided into “partition WITH quorum” and “partition WITHOUT quorum”</li><li>Partition WITHOUT quorum shuts down all the services</li><li>Partition WITH quorum takes over as Postgresql PRIMARY and VIPs</li><li>Everything as expected. Wow !</li></ul>

<div><br>

</div>

<div>In the second case (ethernet interface down)</div>

<ul>

<li>We see lots of errors like the following . On the node  

<ul>

<li>Feb 12 14:09:48 corosync [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.</li><li>Feb 12 14:09:49 corosync [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.</li><li>Feb 12 14:09:51 corosync [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.</li></ul>

</li><li>But the `crm_mon –Afr` (from the node whose eth0 is down)  always shows the cluster to be fully formed.  

<ul>

<li>It shows all the nodes as UP </li><li>It shows itself as the one running the postgresql PRIMARY  (as was the case before putting the ethernet interface is down)</li></ul>

</li><li>`crm_mon -Afr` on the OTHER nodes show a different story

<ul>

<li>They show the other node as down</li><li>One of the other two nodes takes over the postgresql PRIMARY</li></ul>

</li><li>This leads to a split brain situation which was gracefully avoided in the test case where only “delay is introduced into the interface”</li></ul>

<div><br>

</div>

<div>Questions : </div>

<ul>

<li> Is it a known issue with pacemaker when the ethernet interface is pulled down ?</li><li>Is it an incorrect way of testing the cluster ? There is some information regarding the same in this thread

<a href="http://www.gossamer-threads.com/lists/linuxha/pacemaker/59738">http://www.gossamer-threads.com/lists/linuxha/pacemaker/59738</a>  </li></ul>

<div><br>

</div>

<div>Regards,</div>

<div>Deba</div>

<div><br>

</div>

</body>

</html>