[Pacemaker] Corosync fails to start when NIC is absent

Kostiantyn Ponomarenko konstantin.ponomarenko at gmail.com
Tue Jan 20 04:01:46 EST 2015


Got it. Thank you =)
I just thought about possibility of a NIC to burn down.

Thank you,
Kostya

On Tue, Jan 20, 2015 at 10:50 AM, Jan Friesse <jfriesse at redhat.com> wrote:

> Kostiantyn,
>
>
> > One more thing to clarify.
> > You said "rebind can be avoided" - what does it mean?
>
> By that I mean that as long as you don't shutdown interface everything
> will work as expected. Interface shutdown is administrator decision,
> system doesn't do it automagically :)
>
> Regards,
>   Honza
>
> >
> > Thank you,
> > Kostya
> >
> > On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko <
> > konstantin.ponomarenko at gmail.com> wrote:
> >
> >> Thank you. Now I am aware of it.
> >>
> >> Thank you,
> >> Kostya
> >>
> >> On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse <jfriesse at redhat.com>
> wrote:
> >>
> >>> Kostiantyn,
> >>>
> >>>> Honza,
> >>>>
> >>>> Thank you for helping me.
> >>>> So, there is no defined behavior in case one of the interfaces is not
> in
> >>>> the system?
> >>>
> >>> You are right. There is no defined behavior.
> >>>
> >>> Regards,
> >>>   Honza
> >>>
> >>>
> >>>>
> >>>>
> >>>> Thank you,
> >>>> Kostya
> >>>>
> >>>> On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse <jfriesse at redhat.com>
> >>> wrote:
> >>>>
> >>>>> Kostiantyn,
> >>>>>
> >>>>>
> >>>>>> According to the https://access.redhat.com/solutions/638843 , the
> >>>>>> interface, that is defined in the corosync.conf, must be present in
> >>> the
> >>>>>> system (see at the bottom of the article, section "ROOT CAUSE").
> >>>>>> To confirm that I made a couple of tests.
> >>>>>>
> >>>>>> Here is a part of the corosync.conf file (in a free-write form)
> (also
> >>>>>> attached the origin config file):
> >>>>>> ===============================
> >>>>>> rrp_mode: passive
> >>>>>> ring0_addr is defined in corosync.conf
> >>>>>> ring1_addr is defined in corosync.conf
> >>>>>> ===============================
> >>>>>>
> >>>>>> -------------------------------
> >>>>>>
> >>>>>> Two-node cluster
> >>>>>>
> >>>>>> -------------------------------
> >>>>>>
> >>>>>> Test #1:
> >>>>>> --------------------------------------------------
> >>>>>> IP for ring0 is not defines in the system:
> >>>>>> --------------------------------------------------
> >>>>>> Start Corosync simultaneously on both nodes.
> >>>>>> Corosync fails to start.
> >>>>>> From the logs:
> >>>>>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error
> in
> >>>>>> config: No interfaces defined
> >>>>>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync
> >>> Cluster
> >>>>>> Engine exiting with status 8 at main.c:1343.
> >>>>>> Result: Corosync and Pacemaker are not running.
> >>>>>>
> >>>>>> Test #2:
> >>>>>> --------------------------------------------------
> >>>>>> IP for ring1 is not defines in the system:
> >>>>>> --------------------------------------------------
> >>>>>> Start Corosync simultaneously on both nodes.
> >>>>>> Corosync starts.
> >>>>>> Start Pacemaker simultaneously on both nodes.
> >>>>>> Pacemaker fails to start.
> >>>>>> From the logs, the last writes from the "corosync":
> >>>>>> Jan 8 16:31:29 daemon.err<27> corosync[3728]: [TOTEM ] Marking
> ringid
> >>> 0
> >>>>>> interface 169.254.1.3 FAULTY
> >>>>>> Jan 8 16:31:30 daemon.notice<29> corosync[3728]: [TOTEM ]
> >>> Automatically
> >>>>>> recovered ring 0
> >>>>>> Result: Corosync and Pacemaker are not running.
> >>>>>>
> >>>>>>
> >>>>>> Test #3:
> >>>>>>
> >>>>>> "rrp_mode: active" leads to the same result, except Corosync and
> >>>>> Pacemaker
> >>>>>> init scripts return status "running".
> >>>>>> But still "vim /var/log/cluster/corosync.log" shows a lot of errors
> >>> like:
> >>>>>> Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch:
> >>> Connection
> >>>>>> to the CPG API failed: Library error (2)
> >>>>>>
> >>>>>> Result: Corosync and Pacemaker show their statuses as "running", but
> >>>>>> "crm_mon" cannot connect to the cluster database. And half of the
> >>>>>> Pacemaker's services are not running (including Cluster Information
> >>> Base
> >>>>>> (CIB)).
> >>>>>>
> >>>>>>
> >>>>>> -------------------------------
> >>>>>>
> >>>>>> For a single node mode
> >>>>>>
> >>>>>> -------------------------------
> >>>>>>
> >>>>>> IP for ring0 is not defines in the system:
> >>>>>>
> >>>>>> Corosync fails to start.
> >>>>>>
> >>>>>> IP for ring1 is not defines in the system:
> >>>>>>
> >>>>>> Corosync and Pacemaker are started.
> >>>>>>
> >>>>>> It is possible that configuration will be applied successfully
> (50%),
> >>>>>>
> >>>>>> and it is possible that the cluster is not running any resources,
> >>>>>>
> >>>>>> and it is possible that the node cannot be put in a standby mode
> >>> (shows:
> >>>>>> communication error),
> >>>>>>
> >>>>>> and it is possible that the cluster is running all resources, but
> >>> applied
> >>>>>> configuration is not guaranteed to be fully loaded (some rules can
> be
> >>>>>> missed).
> >>>>>>
> >>>>>>
> >>>>>> -------------------------------
> >>>>>>
> >>>>>> Conclusions:
> >>>>>>
> >>>>>> -------------------------------
> >>>>>>
> >>>>>> It is possible that in some rare cases (see comments to the bug) the
> >>>>>> cluster will work, but in that case its working state is unstable
> and
> >>> the
> >>>>>> cluster can stop working every moment.
> >>>>>>
> >>>>>>
> >>>>>> So, is it correct? Does my assumptions make any sense? I didn't any
> >>> other
> >>>>>> explanation in the network ... .
> >>>>>
> >>>>> Corosync needs all interfaces during start and runtime. This doesn't
> >>>>> mean they must be connected (this would make corosync unusable for
> >>>>> physical NIC/Switch or cable failure), but they must be up and have
> >>>>> correct ip.
> >>>>>
> >>>>> When this is not the case, corosync rebinds to localhost and weird
> >>>>> things happens. Removal of this rebinding is long time TODO, but
> there
> >>>>> are still more important bugs (especially because rebind can be
> >>> avoided).
> >>>>>
> >>>>> Regards,
> >>>>>   Honza
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Thank you,
> >>>>>> Kostya
> >>>>>>
> >>>>>> On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko <
> >>>>>> konstantin.ponomarenko at gmail.com> wrote:
> >>>>>>
> >>>>>>> Hi guys,
> >>>>>>>
> >>>>>>> Corosync fails to start if there is no such network interface
> >>> configured
> >>>>>>> in the system.
> >>>>>>> Even with "rrp_mode: passive" the problem is the same when at least
> >>> one
> >>>>>>> network interface is not configured in the system.
> >>>>>>>
> >>>>>>> Is this the expected behavior?
> >>>>>>> I thought that when you use redundant rings, it is enough to have
> at
> >>>>> least
> >>>>>>> one NIC configured in the system. Am I wrong?
> >>>>>>>
> >>>>>>> Thank you,
> >>>>>>> Kostya
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>>>
> >>>>>> Project Home: http://www.clusterlabs.org
> >>>>>> Getting started:
> >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>>>> Bugs: http://bugs.clusterlabs.org
> >>>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>>
> >>>>> Project Home: http://www.clusterlabs.org
> >>>>> Getting started:
> >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>>> Bugs: http://bugs.clusterlabs.org
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>
> >>>> Project Home: http://www.clusterlabs.org
> >>>> Getting started:
> >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>> Bugs: http://bugs.clusterlabs.org
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>>
> >>
> >>
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150120/e3c199fa/attachment-0003.html>


More information about the Pacemaker mailing list