[ClusterLabs] Virtual ip resource restarted on node with down network device

Ken Gaillot kgaillot at redhat.com
Tue Sep 20 14:53:16 UTC 2016


On 09/20/2016 06:39 AM, Auer, Jens wrote:
> Hi,
> 
> I've updated to resource-agents 3.9.7 which is the latest stable version, but I am still seeing the same issues.

Are you still getting the "Preventing mda-ip from re-starting anywhere"
message? I don't see that here. If that's gone, then it's one step
forward anyway.

> MDA1PFP-S01 11:31:40 2495 130 ~ # yum list resource-agents
> Loaded plugins: langpacks, product-id, search-disabled-repos, subscription-manager
> Installed Packages
> resource-agents.x86_64                                                                                    3.9.7-4.el7                                                                                    @/resource-agents-3.9.7-4.el7.x86_64
> 
> ifdown still shows the same behavior. Initially, I can see two ip addresses assigned to device bond0. After doing "ifdown bond0" on the command line, Pacemaker restarts the resource "successfully" but does not assign the default ip address to the device:
> 25: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN qlen 30000
>     link/ether 46:0a:be:70:36:11 brd ff:ff:ff:ff:ff:ff
>     inet 192.168.120.20/32 scope global bond0
>        valid_lft forever preferred_lft forever
> 
> The log says that IPaddr2 assigns 192.168.120.20 to bond0, but nothing else:
> Sep 20 11:34:25 MDA1PFP-S01 kernel: bond0: Removing slave eno49
> Sep 20 11:34:25 MDA1PFP-S01 kernel: bond0: Releasing active interface eno49
> Sep 20 11:34:25 MDA1PFP-S01 kernel: bond0: the permanent HWaddr of eno49 - 5c:b9:01:9c:e7:fc - is still in use by bond0 - set the HWaddr of eno49 to a different address to avoid conflicts
> Sep 20 11:34:25 MDA1PFP-S01 kernel: bond0: making interface eno50 the new active one
> Sep 20 11:34:25 MDA1PFP-S01 kernel: ixgbe 0000:04:00.0: removed PHC on eno49
> Sep 20 11:34:25 MDA1PFP-S01 NetworkManager[881]: <info>  (bond0): bond slave eno49 was released
> Sep 20 11:34:25 MDA1PFP-S01 NetworkManager[881]: <info>  (eno49): released from master bond0
> Sep 20 11:34:26 MDA1PFP-S01 kernel: bond0: Removing slave eno50
> Sep 20 11:34:26 MDA1PFP-S01 kernel: bond0: Releasing active interface eno50
> Sep 20 11:34:26 MDA1PFP-S01 kernel: ixgbe 0000:04:00.1: removed PHC on eno50
> Sep 20 11:34:26 MDA1PFP-S01 NetworkManager[881]: <info>  (bond0): bond slave eno50 was released
> Sep 20 11:34:26 MDA1PFP-S01 NetworkManager[881]: <info>  (eno50): released from master bond0
> Sep 20 11:34:26 MDA1PFP-S01 NetworkManager[881]: <info>  (eno50): link disconnected
> Sep 20 11:34:26 MDA1PFP-S01 NetworkManager[881]: <info>  (bond0): link disconnected
> Sep 20 11:34:26 MDA1PFP-S01 avahi-daemon[912]: Withdrawing address record for 192.168.120.10 on bond0.
> Sep 20 11:34:26 MDA1PFP-S01 avahi-daemon[912]: Leaving mDNS multicast group on interface bond0.IPv4 with address 192.168.120.10.
> Sep 20 11:34:26 MDA1PFP-S01 avahi-daemon[912]: Joining mDNS multicast group on interface bond0.IPv4 with address 192.168.120.20.
> Sep 20 11:34:26 MDA1PFP-S01 avahi-daemon[912]: Withdrawing address record for 192.168.120.20 on bond0.
> Sep 20 11:34:26 MDA1PFP-S01 avahi-daemon[912]: Leaving mDNS multicast group on interface bond0.IPv4 with address 192.168.120.20.
> Sep 20 11:34:26 MDA1PFP-S01 avahi-daemon[912]: Interface bond0.IPv4 no longer relevant for mDNS.
> Sep 20 11:34:26 MDA1PFP-S01 avahi-daemon[912]: Withdrawing address record for fe80::5eb9:1ff:fe9c:e7fc on bond0.
> Sep 20 11:34:29 MDA1PFP-S01 corosync[30167]: [TOTEM ] Retransmit List: 7e
> Sep 20 11:34:29 MDA1PFP-S01 corosync[30167]: [TOTEM ] Retransmit List: 7e
> Sep 20 11:34:29 MDA1PFP-S01 corosync[30167]: [TOTEM ] Marking ringid 1 interface 192.168.120.10 FAULTY
> Sep 20 11:34:29 MDA1PFP-S01 corosync[30167]: [TOTEM ] Retransmit List: 7e
> Sep 20 11:34:29 MDA1PFP-S01 IPaddr2(mda-ip)[32025]: INFO: IP status = no, IP_CIP=
> Sep 20 11:34:29 MDA1PFP-S01 crmd[30188]:  notice: Operation mda-ip_stop_0: ok (node=MDA1PFP-PCS01, call=9, rc=0, cib-update=17, confirmed=true)
> Sep 20 11:34:29 MDA1PFP-S01 IPaddr2(mda-ip)[32072]: INFO: Adding inet address 192.168.120.20/32 to device bond0
> Sep 20 11:34:29 MDA1PFP-S01 IPaddr2(mda-ip)[32072]: INFO: Bringing device bond0 up
> Sep 20 11:34:29 MDA1PFP-S01 kernel: IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
> Sep 20 11:34:29 MDA1PFP-S01 avahi-daemon[912]: Joining mDNS multicast group on interface bond0.IPv4 with address 192.168.120.20.
> Sep 20 11:34:29 MDA1PFP-S01 avahi-daemon[912]: New relevant interface bond0.IPv4 for mDNS.
> Sep 20 11:34:29 MDA1PFP-S01 avahi-daemon[912]: Registering new address record for 192.168.120.20 on bond0.IPv4.
> Sep 20 11:34:29 MDA1PFP-S01 IPaddr2(mda-ip)[32072]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-192.168.120.20 bond0 192.168.120.20 auto not_used not_used
> Sep 20 11:34:29 MDA1PFP-S01 crmd[30188]:  notice: Operation mda-ip_start_0: ok (node=MDA1PFP-PCS01, call=10, rc=0, cib-update=18, confirmed=true)

Pacemaker can only act on what it gets from the resource agent. Here,
the agent is reporting success, which from its point of view is probably
correct -- it can add the IP.

I do think ifdown is not quite the best failure simulation, since there
aren't that many real-world situation that merely take an interface
down. To simulate network loss (without pulling the cable), I think
maybe using the firewall to block all traffic to and from the interface
might be better.

Of course, it is useful to trace how the cluster reacts in this
situation too, but maybe it's not so useful as a network failure test.

> The VIP is reachable locally, but not from other hosts:
> MDA1PFP-S01 11:36:12 2526 0 ~ # ping 192.168.120.20
> PING 192.168.120.20 (192.168.120.20) 56(84) bytes of data.
> 64 bytes from 192.168.120.20: icmp_seq=1 ttl=64 time=0.027 ms
> 64 bytes from 192.168.120.20: icmp_seq=2 ttl=64 time=0.016 ms
> 64 bytes from 192.168.120.20: icmp_seq=3 ttl=64 time=0.029 ms
> 
> MDA1PFP-S02 11:33:31 1273 0 ~ # ping 192.168.120.20
> PING 192.168.120.20 (192.168.120.20) 56(84) bytes of data.
> From 192.168.120.11 icmp_seq=10 Destination Host Unreachable
> From 192.168.120.11 icmp_seq=11 Destination Host Unreachable
> From 192.168.120.11 icmp_seq=12 Destination Host Unreachable
> 
> Best wishes,
>   Jens
> 
> 
> --
> Jens Auer | CGI | Software-Engineer
> CGI (Germany) GmbH & Co. KG
> Rheinstraße 95 | 64295 Darmstadt | Germany
> T: +49 6151 36860 154
> jens.auer at cgi.com
> Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter de.cgi.com/pflichtangaben.
> 
> CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI Group Inc. and its affiliates may be contained in this message. If you are not a recipient indicated or intended in this message (or responsible for delivery of this message to such person), or you think for any reason that this message may have been addressed to you in error, you may not use or copy or deliver this message to anyone else. In such case, you should destroy this message and are asked to notify the sender by reply e-mail.
> 
> ________________________________________
> Von: Ken Gaillot [kgaillot at redhat.com]
> Gesendet: Montag, 19. September 2016 17:31
> An: users at clusterlabs.org
> Betreff: Re: [ClusterLabs] Virtual ip resource restarted on node with down network device
> 
> On 09/19/2016 10:04 AM, Jan Pokorný wrote:
>> On 19/09/16 10:18 +0000, Auer, Jens wrote:
>>> Ok, after reading the log files again I found
>>>
>>> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Initiating action 3: stop mda-ip_stop_0 on MDA1PFP-PCS01 (local)
>>> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: MDA1PFP-PCS01-mda-ip_monitor_1000:14 [ ocf-exit-reason:Unknown interface [bond0] No such device.\n ]
>>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface [bond0] No such device.
>>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed
>>> Sep 19 10:03:45 MDA1PFP-S01 lrmd[7794]:  notice: mda-ip_stop_0:8745:stderr [ ocf-exit-reason:Unknown interface [bond0] No such device. ]
>>> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Operation mda-ip_stop_0: ok (node=MDA1PFP-PCS01, call=16, rc=0, cib-update=49, confirmed=true)
>>> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: Transition 3 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-501.bz2): Complete
>>> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
>>> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>>> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:  notice: On loss of CCM Quorum: Ignore
>>> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: warning: Processing failed op monitor for mda-ip on MDA1PFP-PCS01: not configured (6)
>>> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:   error: Preventing mda-ip from re-starting anywhere: operation monitor failed 'not configured' (6)
>>>
>>> I think that explains why the resource is not started on the other
>>> node, but I am not sure this is a good decision. It seems to be a
>>> little harsh to prevent the resource from starting anywhere,
>>> especially considering that the other node will be able to start the
>>> resource.
> 
> The resource agent is supposed to return "not configured" only when the
> *pacemaker* configuration of the resource is inherently invalid, so
> there's no chance of it starting anywhere.
> 
> As Jan suggested, make sure you've applied any resource-agents updates.
> If that doesn't fix it, it sounds like a bug in the agent, or something
> really is wrong with your pacemaker resource config.
> 
>>
>> The problem to start with is that based on
>>
>>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface [bond0] No such device.
>>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed
>>
>> you may be using too ancient version resource-agents:
>>
>> https://github.com/ClusterLabs/resource-agents/pull/320
>>
>> so until you update, the troubleshooting would be quite moot.




More information about the Users mailing list