[ClusterLabs] Three node cluster becomes completely fenced if one node leaves

Fri Mar 31 06:54:44 UTC 2017

Digimer napsal(a):
> On 31/03/17 02:32 AM, Jan Friesse wrote:
>>> The original message has the logs from nodes 1 and 3. Node 2, the one
>>> that
>>> got fenced in this test, doesn't really show much. Here are the logs from
>>> it:
>>>
>>> Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #5 enp6s0f0,
>>> 192.168.100.14#123, interface stats: received=0, sent=0, dropped=0,
>>> active_time=3253 secs
>>> Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #7 enp6s0f0,
>>> fe80::a236:9fff:fe8a:6500%6#123, interface stats: received=0, sent=0,
>>> dropped=0, active_time=3253 secs
>>> Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] A processor failed,
>>> forming new configuration.
>>> Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] A processor failed,
>>> forming
>>> new configuration.
>>> Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] The network
>>> interface
>>> is down.
>>
>> This is problem. Corosync handles ifdown really badly. If this was not
>> intentional it may be caused by NetworkManager. Then please install
>> equivalent of NetworkManager-config-server package (it's actually one
>> file called 00-server.conf so you can extract it from, for example,
>> Fedora package
>> https://www.rpmfind.net/linux/RPM/fedora/devel/rawhide/x86_64/n/NetworkManager-config-server-1.8.0-0.1.fc27.noarch.html)
>
> ifdown'ing corosync's interface happens a lot, intentionally or
> otherwise. I think it is reasonable to expect corosync to handle this
> properly. How hard would it be to make corosync resilient to this fault
> case?

Really hard. Knet (so whatever becomes corosync 3.x) should solve this 
issue.