[ClusterLabs] service network restart and corosync

Tue Mar 29 08:54:49 UTC 2016

> Hi (Jan Friesse)
>
> I studied the issue mentioned in the github url.
> It looks the crash that I am talking about is slightly different from the
> one mentioned in the original issue. May be they are related, but I would
> like to
> Highlight my setup for ease.
>
> Three node cluster , one is in maintenance mode to prevent any scheduling
> of resources.
> =====
> Stack: classic openais (with plugin)

^^ I'm pretty sure you don't want to use plugin based pcmk.

> Current DC: vm2cen66.mobileum.com - partition with quorum
> Version: 1.1.11-97629de
> 3 Nodes configured, 3 expected votes
> 6 Resources configured
>
>
> Node vm3cent66.mobileum.com: maintenance
> Online: [ vm1cen66.mobileum.com vm2cen66.mobileum.com ]
> ====
>
> I login to vm1cen66 and do `ifdown eth0`
> In vm1cen66, I don¹t see any change in the crm_mon -Afr output.
> It remains the same, as shown below
> ====
> Stack: classic openais (with plugin)
> Current DC: vm2cen66.mobileum.com - partition with quorum
> Version: 1.1.11-97629de
> 3 Nodes configured, 3 expected votes
> 6 Resources configured
>
>
> Node vm3cent66.mobileum.com: maintenance
> Online: [ vm1cen66.mobileum.com vm2cen66.mobileum.com ]
> ===
>
>
> But if we login to the other nodes like vm2cen66, vem3cent66, we can
> correctly see that the node vm1cen66 is offline.

That is expected

>
>
> But if we look into the corosync.log of vm1cen66 we see the following
>
> ===
> Mar 28 14:55:09 corosync [MAIN  ] Totem is unable to form a cluster
> because of an operating system or network fault. The most common cause of
> this message is that the local firewall is configured improperly.
> pgsql(TestPostgresql)[28203]:   2016/03/28_14:55:10 INFO: Master does not
> exist.
> pgsql(TestPostgresql)[28203]:   2016/03/28_14:55:10 WARNING: My data is
> out-of-date. status=DISCONNECT
> Mar 28 14:55:11 corosync [MAIN  ] Totem is unable to form a cluster
> because of an operating system or network fault. The most common cause of
> this message is that the local firewall is configured improperly.
> Mar 28 14:55:12 corosync [MAIN  ] Totem is unable to form a cluster
> because of an operating system or network fault. The most common cause of
> this message is that the local firewall is configured improperly.
> ======
>

This is result of ifdown. Just don't do that.

What exact version of corosync are you using?

>
> Pgsql resource (the postgresql resource agent) is running on this
> particular node . I did a pgrep of the process and found it running. Not
> attaching the logs for now.
>
> The ³crash² happens when the ethernet interface is brought up. vm1cen66 is
> unable to reconnect to the cluster because corosync has crashed, taking
> some processes of pacemaker along with it.
> crm_mon too stops working (it was working previously, before putting the
> interface up)
>
>
> I have to restart the corosync and pacemaker services to make it work
> again.

That's why I keep saying don't do ifdown.

>
>
> The main observation is that the node where the ethernet interface is
> down, does not really ³get² it. It assumes that the other nodes are still
> online, although the logs do say that the interface is down.
>
> Queries/Observations:
> 1- node vm1cen66 should realise that the other nodes are offline

That would be correct behavior, yes.

> 2- From the discussion in the github issue it seems that in case of
> ethernet failure we want it to run as a single node setup. Is that so ?

Not exactly. It should behave like all other nodes gone down.

> 	2a. If that is the case will it honour no-quorum-policy=ignore and stop
> processes ?
> 	2b. Or will it assume that it is a single node cluster and decided
> accordingly ?
> 3- After doing an interface down, if we grep for the corosync port in the
> netstat command , we see that the corosync process has now bound the
> loopback interface. Previously it was bound to the ip on eth0.
> 	Is this expected ? As per the discussion it should be so. But the crash
> did not happen immediately. It crashes when we bring the ethernet
> interface up.

This is expected.

> 	If the corosync did crash, why were we observing the logs in corosync.log
> 4- Is it possible to prevent the corosync crash that we witnessed when the
> ethernet interface is brought up.

Nope. Just don't do ifdown.

> 5- Will preventing the corosync crash really matter ? Because the node
> vm1cen66 has now converted into a single node cluster ? Or will it
> automatically re-bind to eth0 when interface is brought up
> 	(Could not verify because of the crash)

It's rebound to eth0, send wrong information to other nodes and totally 
destroy membership. Again, just don't do ifdown.

> 6- What about the split brain situation due to pacemaker not shutting down
> the services on that single node ?
> 	In a master-slave configuration this causes some confusion as to which
> instance should be made a master after the node joins back.
> 	As per the suggestion from the group , we need to configure stonith for
> it. Configuring stonith seems to be the topmost priority in pacemaker
> clusters.

It's not exactly topmost priority, but it's easy way how to solve many 
problems.

> 	But as far as I gather, we need specialised hardware for this ?

I believe there were also SW based stonith agents (eventho not that 
reliable so not exactly recommended). Also most of the servers have at 
least IPMI.

And last recommendation. Don't do ifdown.

Regards,
   Honza

>
> Regards,
> Debabrata Pani
>
>
>
>
>
>
> On 03/03/16 13:46, "Jan Friesse" <jfriesse at redhat.com> wrote:
>
>>
>>> Hi,
>>>
>>> In our deployment, due to some requirement, we need to do a :
>>> service network restart
>>
>> What is exact reason for doing network restart?
>>
>>>
>>> Due to this corosync crashes and the associated pacemaker processes
>>> crash
>>> as well.
>>>
>>> As per the last comment on this issue,
>>> -------
>>> Corosync reacts oddly to that. It's better to use an iptables rule to
>>> block traffic (or crash the node with something like 'echo c >
>>> /proc/sysrq-trigge
>>> --------
>>>
>>>
>>>
>>> But other network services, like Postgres, do not crash due to this
>>> network service restart :
>>> 	I can login to psql , issue queries, without any problem.
>>>
>>> In view of this, I would like to understand if it is possible to
>>> prevent a
>>> corosync (and a corresponding Pacemaker) crash ?
>>> Since postgres is somehow surviving this restart.
>>>
>>> Any pointer to socket-level details for this behaviour will help me
>>> understand (and explain the stakeholders) the problems better.
>>
>> https://github.com/corosync/corosync/pull/32 should help.
>>
>> Regards,
>>    Honza
>>
>>>
>>> Regards,
>>> Debabrata Pani
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>