[ClusterLabs] service network restart and corosync

Tue Mar 29 07:49:47 UTC 2016

Hi (Jan Friesse)

I studied the issue mentioned in the github url.
It looks the crash that I am talking about is slightly different from the
one mentioned in the original issue. May be they are related, but I would
like to 
Highlight my setup for ease.

Three node cluster , one is in maintenance mode to prevent any scheduling
of resources.
=====
Stack: classic openais (with plugin)
Current DC: vm2cen66.mobileum.com - partition with quorum
Version: 1.1.11-97629de
3 Nodes configured, 3 expected votes
6 Resources configured

Node vm3cent66.mobileum.com: maintenance
Online: [ vm1cen66.mobileum.com vm2cen66.mobileum.com ]
====

I login to vm1cen66 and do `ifdown eth0`
In vm1cen66, I don¹t see any change in the crm_mon -Afr output.
It remains the same, as shown below
====
Stack: classic openais (with plugin)
Current DC: vm2cen66.mobileum.com - partition with quorum
Version: 1.1.11-97629de
3 Nodes configured, 3 expected votes
6 Resources configured

Node vm3cent66.mobileum.com: maintenance
Online: [ vm1cen66.mobileum.com vm2cen66.mobileum.com ]
===

But if we login to the other nodes like vm2cen66, vem3cent66, we can
correctly see that the node vm1cen66 is offline.

But if we look into the corosync.log of vm1cen66 we see the following

===
Mar 28 14:55:09 corosync [MAIN  ] Totem is unable to form a cluster
because of an operating system or network fault. The most common cause of
this message is that the local firewall is configured improperly.
pgsql(TestPostgresql)[28203]:   2016/03/28_14:55:10 INFO: Master does not
exist.
pgsql(TestPostgresql)[28203]:   2016/03/28_14:55:10 WARNING: My data is
out-of-date. status=DISCONNECT
Mar 28 14:55:11 corosync [MAIN  ] Totem is unable to form a cluster
because of an operating system or network fault. The most common cause of
this message is that the local firewall is configured improperly.
Mar 28 14:55:12 corosync [MAIN  ] Totem is unable to form a cluster
because of an operating system or network fault. The most common cause of
this message is that the local firewall is configured improperly.
======

Pgsql resource (the postgresql resource agent) is running on this
particular node . I did a pgrep of the process and found it running. Not
attaching the logs for now.

The ³crash² happens when the ethernet interface is brought up. vm1cen66 is
unable to reconnect to the cluster because corosync has crashed, taking
some processes of pacemaker along with it.
crm_mon too stops working (it was working previously, before putting the
interface up)

I have to restart the corosync and pacemaker services to make it work
again.

The main observation is that the node where the ethernet interface is
down, does not really ³get² it. It assumes that the other nodes are still
online, although the logs do say that the interface is down.

Queries/Observations:
1- node vm1cen66 should realise that the other nodes are offline
2- From the discussion in the github issue it seems that in case of
ethernet failure we want it to run as a single node setup. Is that so ?
	2a. If that is the case will it honour no-quorum-policy=ignore and stop
processes ? 
	2b. Or will it assume that it is a single node cluster and decided
accordingly ?
3- After doing an interface down, if we grep for the corosync port in the
netstat command , we see that the corosync process has now bound the
loopback interface. Previously it was bound to the ip on eth0.
	Is this expected ? As per the discussion it should be so. But the crash
did not happen immediately. It crashes when we bring the ethernet
interface up.
	If the corosync did crash, why were we observing the logs in corosync.log
4- Is it possible to prevent the corosync crash that we witnessed when the
ethernet interface is brought up.
5- Will preventing the corosync crash really matter ? Because the node
vm1cen66 has now converted into a single node cluster ? Or will it
automatically re-bind to eth0 when interface is brought up
	(Could not verify because of the crash)
6- What about the split brain situation due to pacemaker not shutting down
the services on that single node ?
	In a master-slave configuration this causes some confusion as to which
instance should be made a master after the node joins back.
	As per the suggestion from the group , we need to configure stonith for
it. Configuring stonith seems to be the topmost priority in pacemaker
clusters.
	But as far as I gather, we need specialised hardware for this ?

Regards,
Debabrata Pani

On 03/03/16 13:46, "Jan Friesse" <jfriesse at redhat.com> wrote:

>
>> Hi,
>>
>> In our deployment, due to some requirement, we need to do a :
>> service network restart
>
>What is exact reason for doing network restart?
>
>>
>> Due to this corosync crashes and the associated pacemaker processes
>>crash
>> as well.
>>
>> As per the last comment on this issue,
>> -------
>> Corosync reacts oddly to that. It's better to use an iptables rule to
>> block traffic (or crash the node with something like 'echo c >
>> /proc/sysrq-trigge
>> --------
>>
>>
>>
>> But other network services, like Postgres, do not crash due to this
>> network service restart :
>> 	I can login to psql , issue queries, without any problem.
>>
>> In view of this, I would like to understand if it is possible to
>>prevent a
>> corosync (and a corresponding Pacemaker) crash ?
>> Since postgres is somehow surviving this restart.
>>
>> Any pointer to socket-level details for this behaviour will help me
>> understand (and explain the stakeholders) the problems better.
>
>https://github.com/corosync/corosync/pull/32 should help.
>
>Regards,
>   Honza
>
>>
>> Regards,
>> Debabrata Pani
>>
>>
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
>_______________________________________________
>Users mailing list: Users at clusterlabs.org
>http://clusterlabs.org/mailman/listinfo/users
>
>Project Home: http://www.clusterlabs.org
>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>Bugs: http://bugs.clusterlabs.org