[ClusterLabs] Virtual ip resource restarted on node with down network device

Tue Sep 20 11:39:35 UTC 2016

Hi,

I've updated to resource-agents 3.9.7 which is the latest stable version, but I am still seeing the same issues.
MDA1PFP-S01 11:31:40 2495 130 ~ # yum list resource-agents
Loaded plugins: langpacks, product-id, search-disabled-repos, subscription-manager
Installed Packages
resource-agents.x86_64                                                                                    3.9.7-4.el7                                                                                    @/resource-agents-3.9.7-4.el7.x86_64

ifdown still shows the same behavior. Initially, I can see two ip addresses assigned to device bond0. After doing "ifdown bond0" on the command line, Pacemaker restarts the resource "successfully" but does not assign the default ip address to the device:
25: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN qlen 30000
    link/ether 46:0a:be:70:36:11 brd ff:ff:ff:ff:ff:ff
    inet 192.168.120.20/32 scope global bond0
       valid_lft forever preferred_lft forever

The log says that IPaddr2 assigns 192.168.120.20 to bond0, but nothing else:
Sep 20 11:34:25 MDA1PFP-S01 kernel: bond0: Removing slave eno49
Sep 20 11:34:25 MDA1PFP-S01 kernel: bond0: Releasing active interface eno49
Sep 20 11:34:25 MDA1PFP-S01 kernel: bond0: the permanent HWaddr of eno49 - 5c:b9:01:9c:e7:fc - is still in use by bond0 - set the HWaddr of eno49 to a different address to avoid conflicts
Sep 20 11:34:25 MDA1PFP-S01 kernel: bond0: making interface eno50 the new active one
Sep 20 11:34:25 MDA1PFP-S01 kernel: ixgbe 0000:04:00.0: removed PHC on eno49
Sep 20 11:34:25 MDA1PFP-S01 NetworkManager[881]: <info>  (bond0): bond slave eno49 was released
Sep 20 11:34:25 MDA1PFP-S01 NetworkManager[881]: <info>  (eno49): released from master bond0
Sep 20 11:34:26 MDA1PFP-S01 kernel: bond0: Removing slave eno50
Sep 20 11:34:26 MDA1PFP-S01 kernel: bond0: Releasing active interface eno50
Sep 20 11:34:26 MDA1PFP-S01 kernel: ixgbe 0000:04:00.1: removed PHC on eno50
Sep 20 11:34:26 MDA1PFP-S01 NetworkManager[881]: <info>  (bond0): bond slave eno50 was released
Sep 20 11:34:26 MDA1PFP-S01 NetworkManager[881]: <info>  (eno50): released from master bond0
Sep 20 11:34:26 MDA1PFP-S01 NetworkManager[881]: <info>  (eno50): link disconnected
Sep 20 11:34:26 MDA1PFP-S01 NetworkManager[881]: <info>  (bond0): link disconnected
Sep 20 11:34:26 MDA1PFP-S01 avahi-daemon[912]: Withdrawing address record for 192.168.120.10 on bond0.
Sep 20 11:34:26 MDA1PFP-S01 avahi-daemon[912]: Leaving mDNS multicast group on interface bond0.IPv4 with address 192.168.120.10.
Sep 20 11:34:26 MDA1PFP-S01 avahi-daemon[912]: Joining mDNS multicast group on interface bond0.IPv4 with address 192.168.120.20.
Sep 20 11:34:26 MDA1PFP-S01 avahi-daemon[912]: Withdrawing address record for 192.168.120.20 on bond0.
Sep 20 11:34:26 MDA1PFP-S01 avahi-daemon[912]: Leaving mDNS multicast group on interface bond0.IPv4 with address 192.168.120.20.
Sep 20 11:34:26 MDA1PFP-S01 avahi-daemon[912]: Interface bond0.IPv4 no longer relevant for mDNS.
Sep 20 11:34:26 MDA1PFP-S01 avahi-daemon[912]: Withdrawing address record for fe80::5eb9:1ff:fe9c:e7fc on bond0.
Sep 20 11:34:29 MDA1PFP-S01 corosync[30167]: [TOTEM ] Retransmit List: 7e
Sep 20 11:34:29 MDA1PFP-S01 corosync[30167]: [TOTEM ] Retransmit List: 7e
Sep 20 11:34:29 MDA1PFP-S01 corosync[30167]: [TOTEM ] Marking ringid 1 interface 192.168.120.10 FAULTY
Sep 20 11:34:29 MDA1PFP-S01 corosync[30167]: [TOTEM ] Retransmit List: 7e
Sep 20 11:34:29 MDA1PFP-S01 IPaddr2(mda-ip)[32025]: INFO: IP status = no, IP_CIP=
Sep 20 11:34:29 MDA1PFP-S01 crmd[30188]:  notice: Operation mda-ip_stop_0: ok (node=MDA1PFP-PCS01, call=9, rc=0, cib-update=17, confirmed=true)
Sep 20 11:34:29 MDA1PFP-S01 IPaddr2(mda-ip)[32072]: INFO: Adding inet address 192.168.120.20/32 to device bond0
Sep 20 11:34:29 MDA1PFP-S01 IPaddr2(mda-ip)[32072]: INFO: Bringing device bond0 up
Sep 20 11:34:29 MDA1PFP-S01 kernel: IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
Sep 20 11:34:29 MDA1PFP-S01 avahi-daemon[912]: Joining mDNS multicast group on interface bond0.IPv4 with address 192.168.120.20.
Sep 20 11:34:29 MDA1PFP-S01 avahi-daemon[912]: New relevant interface bond0.IPv4 for mDNS.
Sep 20 11:34:29 MDA1PFP-S01 avahi-daemon[912]: Registering new address record for 192.168.120.20 on bond0.IPv4.
Sep 20 11:34:29 MDA1PFP-S01 IPaddr2(mda-ip)[32072]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-192.168.120.20 bond0 192.168.120.20 auto not_used not_used
Sep 20 11:34:29 MDA1PFP-S01 crmd[30188]:  notice: Operation mda-ip_start_0: ok (node=MDA1PFP-PCS01, call=10, rc=0, cib-update=18, confirmed=true)

The VIP is reachable locally, but not from other hosts:
MDA1PFP-S01 11:36:12 2526 0 ~ # ping 192.168.120.20
PING 192.168.120.20 (192.168.120.20) 56(84) bytes of data.
64 bytes from 192.168.120.20: icmp_seq=1 ttl=64 time=0.027 ms
64 bytes from 192.168.120.20: icmp_seq=2 ttl=64 time=0.016 ms
64 bytes from 192.168.120.20: icmp_seq=3 ttl=64 time=0.029 ms

MDA1PFP-S02 11:33:31 1273 0 ~ # ping 192.168.120.20
PING 192.168.120.20 (192.168.120.20) 56(84) bytes of data.
>From 192.168.120.11 icmp_seq=10 Destination Host Unreachable
>From 192.168.120.11 icmp_seq=11 Destination Host Unreachable
>From 192.168.120.11 icmp_seq=12 Destination Host Unreachable

Best wishes,
  Jens

--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.auer at cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI Group Inc. and its affiliates may be contained in this message. If you are not a recipient indicated or intended in this message (or responsible for delivery of this message to such person), or you think for any reason that this message may have been addressed to you in error, you may not use or copy or deliver this message to anyone else. In such case, you should destroy this message and are asked to notify the sender by reply e-mail.

________________________________________
Von: Ken Gaillot [kgaillot at redhat.com]
Gesendet: Montag, 19. September 2016 17:31
An: users at clusterlabs.org
Betreff: Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

On 09/19/2016 10:04 AM, Jan Pokorný wrote:
> On 19/09/16 10:18 +0000, Auer, Jens wrote:
>> Ok, after reading the log files again I found
>>
>> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Initiating action 3: stop mda-ip_stop_0 on MDA1PFP-PCS01 (local)
>> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: MDA1PFP-PCS01-mda-ip_monitor_1000:14 [ ocf-exit-reason:Unknown interface [bond0] No such device.\n ]
>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface [bond0] No such device.
>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed
>> Sep 19 10:03:45 MDA1PFP-S01 lrmd[7794]:  notice: mda-ip_stop_0:8745:stderr [ ocf-exit-reason:Unknown interface [bond0] No such device. ]
>> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Operation mda-ip_stop_0: ok (node=MDA1PFP-PCS01, call=16, rc=0, cib-update=49, confirmed=true)
>> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: Transition 3 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-501.bz2): Complete
>> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
>> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:  notice: On loss of CCM Quorum: Ignore
>> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: warning: Processing failed op monitor for mda-ip on MDA1PFP-PCS01: not configured (6)
>> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:   error: Preventing mda-ip from re-starting anywhere: operation monitor failed 'not configured' (6)
>>
>> I think that explains why the resource is not started on the other
>> node, but I am not sure this is a good decision. It seems to be a
>> little harsh to prevent the resource from starting anywhere,
>> especially considering that the other node will be able to start the
>> resource.

The resource agent is supposed to return "not configured" only when the
*pacemaker* configuration of the resource is inherently invalid, so
there's no chance of it starting anywhere.

As Jan suggested, make sure you've applied any resource-agents updates.
If that doesn't fix it, it sounds like a bug in the agent, or something
really is wrong with your pacemaker resource config.

>
> The problem to start with is that based on
>
>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface [bond0] No such device.
>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed
>
> you may be using too ancient version resource-agents:
>
> https://github.com/ClusterLabs/resource-agents/pull/320
>
> so until you update, the troubleshooting would be quite moot.

_______________________________________________
Users mailing list: Users at clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org