[ClusterLabs] Three node cluster becomes completely fenced if one node leaves

Digimer lists at alteeve.ca
Fri Mar 31 02:39:02 EDT 2017


On 31/03/17 02:32 AM, Jan Friesse wrote:
>> The original message has the logs from nodes 1 and 3. Node 2, the one
>> that
>> got fenced in this test, doesn't really show much. Here are the logs from
>> it:
>>
>> Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #5 enp6s0f0,
>> 192.168.100.14#123, interface stats: received=0, sent=0, dropped=0,
>> active_time=3253 secs
>> Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #7 enp6s0f0,
>> fe80::a236:9fff:fe8a:6500%6#123, interface stats: received=0, sent=0,
>> dropped=0, active_time=3253 secs
>> Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] A processor failed,
>> forming new configuration.
>> Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] A processor failed,
>> forming
>> new configuration.
>> Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] The network
>> interface
>> is down.
> 
> This is problem. Corosync handles ifdown really badly. If this was not
> intentional it may be caused by NetworkManager. Then please install
> equivalent of NetworkManager-config-server package (it's actually one
> file called 00-server.conf so you can extract it from, for example,
> Fedora package
> https://www.rpmfind.net/linux/RPM/fedora/devel/rawhide/x86_64/n/NetworkManager-config-server-1.8.0-0.1.fc27.noarch.html)

ifdown'ing corosync's interface happens a lot, intentionally or
otherwise. I think it is reasonable to expect corosync to handle this
properly. How hard would it be to make corosync resilient to this fault
case?

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould




More information about the Users mailing list