[ClusterLabs] Virtual ip resource restarted on node with down network device

Tue Sep 20 08:38:34 EDT 2016

On Tue, Sep 20, 2016 at 11:44:58AM +0000, Auer, Jens wrote:
> Hi,
> 
> I've decided to create two answers for the two problems. The cluster
> still fails to relocate the resource after unloading the modules even
> with resource-agents 3.9.7

>From the point of view of the resource agent,
you configured it to use a non-existing network.
Which it considers to be a configuration error,
which is treated by pacemaker as
"don't try to restart anywhere
but let someone else configure it properly, first".

I think the OCF_ERR_CONFIGURED is good, though, otherwise 
configuration errors might go unnoticed for quite some time.
A network interface is not supposed to "vanish".

You may disagree with that choice,
in which case you could edit the resource agent to treat it not as
configuration error, but as "required component not installed"
(OCF_ERR_CONFIGURED vs OCF_ERR_INSTALLED), and pacemaker will
"try to find some other node with required components available",
before giving up completely.

Still, I have yet to see what scenario you are trying to test here.
To me, this still looks like "scenario evil admin".  If so, I'd not even
try, at least not on the pacemaker configuration level.

> CONFIDENTIALITY NOTICE:

Oh please :-/
This is a public mailing list.

> There seems to be some difference because the device is not RUNNING;

> Also the netmask and the ip address are wrong. I have configured the
> device to 192.168.120.10 with netmask 192.168.120.10. How does IpAddr2
> get the wrong configuration? I have no idea.

A netmask of "192.168.120.10" is nonsense.
That is the address, not a mask.

Also, according to some posts back,
you have configured it in pacemaker with
cidr_netmask=32, which is not particularly useful either.

You should use the netmask of whatever subnet is supposedly actually
reachable via that address and interface. Typical masks are e.g.
/24, /20, /16 resp. 255.255.255.0, 255.255.240.0, 255.255.0.0

Apparently the RA is "nice" enough (or maybe buggy enough)
to let that slip, and guess the netmask from the routing tables,
or fall back to whatever builtin defaults there are on the various
layers of tools involved.

Again: the IPaddr2 resource agent is supposed to control the assignment
of an IP address, hence the name.

It is not supposed to create or destroy network interfaces,
or configure bonding, or bridges, or anything like that.

In fact, it is not even supposed to bring up or down the interfaces,
even though for "convenience" it seems to do "ip link set up".

That is not a bug, but limited scope.

If you wanted to test the reaction of the cluster to a vanishing
IP address, the correct test would be an
  "ip addr del 192.168.120.10 dev bond0"

And the expectation is that it will notice, and just re-add the address.
That is the scope of the IPaddr2 resource agent.

Monitoring connectivity, or dealing with removed interface drivers,
or unplugged devices, or whatnot, has to be dealt with elsewhere.

What you did is: down the bond, remove all slave assignments, even
remove the driver, and expect the resource agent to "heal" things that
it does not know about. It can not.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R&D, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT