[ClusterLabs] Pacemaker issue when ethernet interface is pulled down

Sun Feb 14 19:11:33 UTC 2016

On 14/02/16 09:48 AM, Debabrata Pani wrote:
> Hi Emmanuel,
> 
> Thank you for the suggestion.
> If I am getting it right, Fencing can be configured to shutdown the node
> on which the ethernet interface has gone down.
> And that appears to be a correct suggestion.
> But I have a few queries still.

Fencing work regardless of why communication with a node is lost (eth
down, hung, caught on fire...). Think of this way; "Fencing puts a node
that has entered an unknown state into a known state" (usually 'off').

> Queries:
> * Is the test case ³put down the ethernet interface² not a valid one ?

Corosync reacts oddly to that. It's better to use an iptables rule to
block traffic (or crash the node with something like 'echo c >
/proc/sysrq-trigger).

> * Why is the node unable to detect that it is cut off from the cluster and
> shut the services down as per the ³no-quorum-policy² configuration ?

In HA, you have to assume that a lost node could be doing anything. You
can't expect it to be operating predictably (as is truly the case in the
real world... imaging bad RAM and what that does to a system). If a
system stops responding, you need an external mechanism to remove it
(IPMI, cut the power via a switched PDU, etc).

> 
> Regards,
> Debabrata
> 
> On 14/02/16 19:31, "emmanuel segura" <emi2fast at gmail.com> wrote:
> 
>> use fence and after you configured the fencing you need to use
>> iptables for testing your cluster, with iptables you can block 5404
>> and 5405 ports
>>
>> 2016-02-14 14:09 GMT+01:00 Debabrata Pani <Debabrata.Pani at mobileum.com>:
>>> Hi,
>>> We ran into some problems when we pull down the ethernet interface using
>>> ³ifconfig eth0 down²
>>>
>>> Our cluster has the following configurations and resources
>>>
>>> Two  network interfaces : eth0 and lo(cal)
>>> 3 nodes with one node put in maintenance mode
>>> No-quorum-policy=stop
>>> Stonith-enabled=false
>>> Postgresql Master/Slave
>>> vip master and vip replication IPs
>>> VIPs will run on the node where Postgresql Master is running
>>>
>>>
>>> Two test cases that we executed are as follows
>>>
>>> Introduce delay in the ethernet interface o f the postgresql PRIMARY
>>> node
>>> (Command  : tc qdisc add dev eth0 root netem delay 8000ms)
>>> `Ifconfig eth0 down` on the postgresql PRIMARY Node
>>> We expected that both these test cases test for network problems in the
>>> cluster
>>>
>>>
>>> In the first case (ethernet interface delay)
>>>
>>> Cluster is divided into ³partition WITH quorum² and ³partition WITHOUT
>>> quorum²
>>> Partition WITHOUT quorum shuts down all the services
>>> Partition WITH quorum takes over as Postgresql PRIMARY and VIPs
>>> Everything as expected. Wow !
>>>
>>>
>>> In the second case (ethernet interface down)
>>>
>>> We see lots of errors like the following . On the node
>>>
>>> Feb 12 14:09:48 corosync [MAIN  ] Totem is unable to form a cluster
>>> because
>>> of an operating system or network fault. The most common cause of this
>>> message is that the local firewall is configured improperly.
>>> Feb 12 14:09:49 corosync [MAIN  ] Totem is unable to form a cluster
>>> because
>>> of an operating system or network fault. The most common cause of this
>>> message is that the local firewall is configured improperly.
>>> Feb 12 14:09:51 corosync [MAIN  ] Totem is unable to form a cluster
>>> because
>>> of an operating system or network fault. The most common cause of this
>>> message is that the local firewall is configured improperly.
>>>
>>> But the `crm_mon Afr` (from the node whose eth0 is down)  always shows
>>> the
>>> cluster to be fully formed.
>>>
>>> It shows all the nodes as UP
>>> It shows itself as the one running the postgresql PRIMARY  (as was the
>>> case
>>> before putting the ethernet interface is down)
>>>
>>> `crm_mon -Afr` on the OTHER nodes show a different story
>>>
>>> They show the other node as down
>>> One of the other two nodes takes over the postgresql PRIMARY
>>>
>>> This leads to a split brain situation which was gracefully avoided in
>>> the
>>> test case where only ³delay is introduced into the interface²
>>>
>>>
>>> Questions :
>>>
>>>  Is it a known issue with pacemaker when the ethernet interface is
>>> pulled
>>> down ?
>>> Is it an incorrect way of testing the cluster ? There is some
>>> information
>>> regarding the same in this thread
>>> http://www.gossamer-threads.com/lists/linuxha/pacemaker/59738
>>>
>>>
>>> Regards,
>>> Deba
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>>
>> -- 
>>  .~.
>>  /V\
>> //  \\
>> /(   )\
>> ^`~'^
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?