[ClusterLabs] Three node cluster becomes completely fenced if one node leaves

Fri Mar 31 13:10:40 EDT 2017

I can confirm that doing an ifdown is not the source of my corosync issues.
My cluster is in another state, so I can't pull a cable, but I can down a
port on a switch. That had the exact same affects as doing an ifdown. Two
machines got fenced when it should have only been one.

-------
Seth Reid
System Operations Engineer
Vendini, Inc.
415.349.7736
sreid at vendini.com
www.vendini.com

On Fri, Mar 31, 2017 at 4:12 AM, Dejan Muhamedagic <dejanmm at fastmail.fm>
wrote:

> Hi,
>
> On Fri, Mar 31, 2017 at 02:39:02AM -0400, Digimer wrote:
> > On 31/03/17 02:32 AM, Jan Friesse wrote:
> > >> The original message has the logs from nodes 1 and 3. Node 2, the one
> > >> that
> > >> got fenced in this test, doesn't really show much. Here are the logs
> from
> > >> it:
> > >>
> > >> Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #5 enp6s0f0,
> > >> 192.168.100.14#123, interface stats: received=0, sent=0, dropped=0,
> > >> active_time=3253 secs
> > >> Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #7 enp6s0f0,
> > >> fe80::a236:9fff:fe8a:6500%6#123, interface stats: received=0, sent=0,
> > >> dropped=0, active_time=3253 secs
> > >> Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] A processor
> failed,
> > >> forming new configuration.
> > >> Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] A processor failed,
> > >> forming
> > >> new configuration.
> > >> Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] The network
> > >> interface
> > >> is down.
> > >
> > > This is problem. Corosync handles ifdown really badly. If this was not
> > > intentional it may be caused by NetworkManager. Then please install
> > > equivalent of NetworkManager-config-server package (it's actually one
> > > file called 00-server.conf so you can extract it from, for example,
> > > Fedora package
> > > https://www.rpmfind.net/linux/RPM/fedora/devel/rawhide/x86_
> 64/n/NetworkManager-config-server-1.8.0-0.1.fc27.noarch.html)
> >
> > ifdown'ing corosync's interface happens a lot, intentionally or
> > otherwise.
>
> I'm not sure, but I think that it can happen only intentionally,
> i.e. through a human intervention. If there's another problem
> with the interface it doesn't disappear from the system.
>
> Thanks,
>
> Dejan
>
> > I think it is reasonable to expect corosync to handle this
> > properly. How hard would it be to make corosync resilient to this fault
> > case?
> >
> > --
> > Digimer
> > Papers and Projects: https://alteeve.com/w/
> > "I am, somehow, less interested in the weight and convolutions of
> > Einstein’s brain than in the near certainty that people of equal talent
> > have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170331/b670446d/attachment-0003.html>