[ClusterLabs] Virtual ip resource restarted on node with down network device

Fri Sep 16 16:01:48 UTC 2016

On 09/16/2016 10:43 AM, Auer, Jens wrote:
> Hi,
> 
> thanks for the help.
> 
>> I'm not sure what you mean by "the device the virtual ip is attached
>> to", but a separate question is why the resource agent reported that
>> restarting the IP was successful, even though that device was
>> unavailable. If the monitor failed when the device was made unavailable,
>> I would expect the restart to fail as well.
> 
> I created the virtual ip with parameter nic=bond0, and this is the device I am bringing down
> and was referring to in my question. I think the current behavior is a little inconsistent. I bring 
> down the device and pacemaker recognizes this and restarts the resource. However, the monitor
> then should fail again, but it just doesn't detect any problems. 

That is odd. Pacemaker is just acting on what the resource agent
reports, so the issue will be in the agent. Agents are usually fairly
simple shell scripts, so you could just look at what it does, and try
running those commands manually and see what the results are.

There are also some pcs commands debug-start, debug-monitor, etc. that
run the agent directly, using the configuration from the cluster.

And you can look in the system log and pacemaker detail log around the
time of the incident for any interesting messages.

> Cheers,
>   Jens
> 
> --
> Jens Auer | CGI | Software-Engineer
> CGI (Germany) GmbH & Co. KG
> Rheinstraße 95 | 64295 Darmstadt | Germany
> T: +49 6151 36860 154
> jens.auer at cgi.com
> Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter de.cgi.com/pflichtangaben.
> 
> CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI Group Inc. and its affiliates may be contained in this message. If you are not a recipient indicated or intended in this message (or responsible for delivery of this message to such person), or you think for any reason that this message may have been addressed to you in error, you may not use or copy or deliver this message to anyone else. In such case, you should destroy this message and are asked to notify the sender by reply e-mail.
> 
> ________________________________________
> Von: Ken Gaillot [kgaillot at redhat.com]
> Gesendet: Freitag, 16. September 2016 17:27
> An: users at clusterlabs.org
> Betreff: Re: [ClusterLabs] Virtual ip resource restarted on node with down network device
> 
> On 09/16/2016 10:08 AM, Auer, Jens wrote:
>> Hi,
>>
>> I have configured an Active/Passive cluster to host a virtual ip
>> address. To test failovers, I shutdown the device the virtual ip is
>> attached to and expected that it moves to the other node. However, the
>> virtual ip is detected as FAILED, but is then restarted on the same
>> node. I was able to solve this by using a ping resource which we want to
>> do anyway, but I am wondering why the resource is restarted on the node
>> and no failure is detected anymore.
> 
> If a *node* fails, pacemaker will recover all its resources elsewhere,
> if possible.
> 
> If a *resource* fails but the node is OK, the response is configurable,
> via the "on-fail" operation option and "migration-threshold" resource
> option.
> 
> By default, on-fail=restart for monitor operations, and
> migration-threshold=INFINITY. This means that if a monitor fails,
> pacemaker will attempt to restart the resource on the same node.
> 
> To get an immediate failover of the resource, set migration-threshold=1
> on the resource.
> 
> I'm not sure what you mean by "the device the virtual ip is attached
> to", but a separate question is why the resource agent reported that
> restarting the IP was successful, even though that device was
> unavailable. If the monitor failed when the device was made unavailable,
> I would expect the restart to fail as well.
> 
>>
>> On my setup, this is very easy to reproduce:
>> 1. Start cluster with virtual ip
>> 2. On the node hosting the virtual ip, bring down the network device
>> with ifdown
>> => The resource is detected as failed
>> => The resource is restarted
>> => No failures are dected from now on
>>
>> Best wishes,
>>   Jens
>>
>> --
>> *Jens Auer *| CGI | Software-Engineer
>> CGI (Germany) GmbH & Co. KG
>> Rheinstraße 95 | 64295 Darmstadt | Germany
>> T: +49 6151 36860 154
>> _jens.auer at cgi.com_ <mailto:jens.auer at cgi.com>
>> Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie
>> unter _de.cgi.com/pflichtangaben_ <http://de.cgi.com/pflichtangaben>.