[ClusterLabs] Virtual ip resource restarted on node with down network device

Tue Sep 20 13:13:23 UTC 2016

Hi,

>> I've decided to create two answers for the two problems. The cluster
>> still fails to relocate the resource after unloading the modules even
>> with resource-agents 3.9.7
> From the point of view of the resource agent,
> you configured it to use a non-existing network.
> Which it considers to be a configuration error,
> which is treated by pacemaker as
> "don't try to restart anywhere
> but let someone else configure it properly, first".
> Still, I have yet to see what scenario you are trying to test here.
> To me, this still looks like "scenario evil admin".  If so, I'd not even
> try, at least not on the pacemaker configuration level.
It's not evil admin as this would not make sense. I am trying to find a way to force a failover condition e.g. by simulating a network card defect or network outage without running to the server room every time. 

> CONFIDENTIALITY NOTICE:
> Oh please :-/
> This is a public mailing list.
Sorry, this is a standard disclaimer I usually remove. We are forced to add this to e-mails, but I think this is fairly common for commercial companies.

>> Also the netmask and the ip address are wrong. I have configured the
>> device to 192.168.120.10 with netmask 192.168.120.10. How does IpAddr2
>> get the wrong configuration? I have no idea.
>A netmask of "192.168.120.10" is nonsense.
>That is the address, not a mask.
Oops, my fault when writing the e-mail. Obviously this is the address. The configured netmask for the device is 255.255.255.0, but after IPaddr2 brings it up again it is 255.255.255.255 which is not what I configured in the betwork configuration. 

> Also, according to some posts back,
> you have configured it in pacemaker with
> cidr_netmask=32, which is not particularly useful either.
Thanks for pointing this out. I copied the parameters from the manual/tutorial, but did not think about the values.

> Again: the IPaddr2 resource agent is supposed to control the assignment
> of an IP address, hence the name.
> It is not supposed to create or destroy network interfaces,
> or configure bonding, or bridges, or anything like that.
> In fact, it is not even supposed to bring up or down the interfaces,
> even though for "convenience" it seems to do "ip link set up".
This is what made me wonder in the beginning. When I bring down the device, this leads to a failure of the resource agent which is exactly what I expected. I did not expect it to bring the device up  again, and definitetly not ignoring the default network configuration.

> Monitoring connectivity, or dealing with removed interface drivers,
> or unplugged devices, or whatnot, has to be dealt with elsewhere.
I am using a ping daemon for that. 

> What you did is: down the bond, remove all slave assignments, even
> remove the driver, and expect the resource agent to "heal" things that
> it does not know about. It can not.
I am not expecting the RA to heal anything. How could it? And why would I expect it? In fact I am expecting the opposite that is a consistent failure when the device is down. This may be also wrong because you can assign ip addresses to downed devices.

My initial expectation was that the resource cannot be started when the device is down and then is relocated. I think this more or less the core functionality of the cluster. I can see a reason why it does not switch to another node when there is a configuration error in the cluster because it is fair to assume that the configuration is identical (wrong) on all nodes. But what happens if the network device is broken? The server would start, fail to assign the ip address and then prevent the whole cluster from working? What happens if the network card breaks while the cluster is running? 

Best wishes,
  Jens