[ClusterLabs] Virtual ip resource restarted on node with down network device

Tue Sep 20 08:16:15 EDT 2016

Hi,

one thing to add is that everything works as expected when I physically unplug the network cables to force a failover. 

Best wishes,
  Jens

--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.auer at cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI Group Inc. and its affiliates may be contained in this message. If you are not a recipient indicated or intended in this message (or responsible for delivery of this message to such person), or you think for any reason that this message may have been addressed to you in error, you may not use or copy or deliver this message to anyone else. In such case, you should destroy this message and are asked to notify the sender by reply e-mail.

________________________________________
Von: Auer, Jens [jens.auer at cgi.com]
Gesendet: Dienstag, 20. September 2016 13:44
An: Cluster Labs - All topics related to open-source clustering welcomed
Betreff: Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

Hi,

I've decided to create two answers for the two problems. The cluster still fails to relocate the resource after unloading the modules even with resource-agents 3.9.7
MDA1PFP-S01 11:42:50 2533 0 ~ # yum list resource-agents
Loaded plugins: langpacks, product-id, search-disabled-repos, subscription-manager
Installed Packages
resource-agents.x86_64                                                                                    3.9.7-4.el7                                                                                    @/resource-agents-3.9.7-4.el7.x86_64

Sep 20 11:42:52 MDA1PFP-S01 crmd[13908]: warning: Action 9 (mda-ip_start_0) on MDA1PFP-PCS01 failed (target: 0 vs. rc: 6): Error
Sep 20 11:42:52 MDA1PFP-S01 crmd[13908]: warning: Action 9 (mda-ip_start_0) on MDA1PFP-PCS01 failed (target: 0 vs. rc: 6): Error
Sep 20 11:42:52 MDA1PFP-S01 crmd[13908]:  notice: Transition 5 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-input-552.bz2): Complete
Sep 20 11:42:52 MDA1PFP-S01 pengine[13907]:  notice: On loss of CCM Quorum: Ignore
Sep 20 11:42:52 MDA1PFP-S01 pengine[13907]: warning: Processing failed op start for mda-ip on MDA1PFP-PCS01: not configured (6)
Sep 20 11:42:52 MDA1PFP-S01 pengine[13907]:   error: Preventing mda-ip from re-starting anywhere: operation start failed 'not configured' (6)
Sep 20 11:42:52 MDA1PFP-S01 pengine[13907]: warning: Processing failed op start for mda-ip on MDA1PFP-PCS01: not configured (6)
Sep 20 11:42:52 MDA1PFP-S01 pengine[13907]:   error: Preventing mda-ip from re-starting anywhere: operation start failed 'not configured' (6)
Sep 20 11:42:52 MDA1PFP-S01 pengine[13907]:  notice: Stop    mda-ip     (MDA1PFP-PCS01)
Sep 20 11:42:52 MDA1PFP-S01 pengine[13907]:  notice: Calculated Transition 6: /var/lib/pacemaker/pengine/pe-input-553.bz2
Sep 20 11:42:52 MDA1PFP-S01 crmd[13908]:  notice: Initiating action 2: stop mda-ip_stop_0 on MDA1PFP-PCS01 (local)
Sep 20 11:42:52 MDA1PFP-S01 IPaddr2(mda-ip)[15336]: INFO: IP status = no, IP_CIP=
Sep 20 11:42:52 MDA1PFP-S01 lrmd[13905]:  notice: mda-ip_stop_0:15336:stderr [ Device "bond0" does not exist. ]
Sep 20 11:42:52 MDA1PFP-S01 crmd[13908]:  notice: Operation mda-ip_stop_0: ok (node=MDA1PFP-PCS01, call=18, rc=0, cib-update=48, confirmed=true)
Sep 20 11:42:53 MDA1PFP-S01 corosync[13887]: [TOTEM ] Retransmit List: 93
Sep 20 11:42:53 MDA1PFP-S01 corosync[13887]: [TOTEM ] Retransmit List: 93 96 98
Sep 20 11:42:53 MDA1PFP-S01 corosync[13887]: [TOTEM ] Retransmit List: 93 98 9a 9c
Sep 20 11:42:53 MDA1PFP-S01 corosync[13887]: [TOTEM ] Marking ringid 1 interface 192.168.120.10 FAULTY
Sep 20 11:42:53 MDA1PFP-S01 corosync[13887]: [TOTEM ] Retransmit List: 98 9c 9f a1
Sep 20 11:42:53 MDA1PFP-S01 crmd[13908]:  notice: Transition 6 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-553.bz2): Complete
Sep 20 11:42:53 MDA1PFP-S01 crmd[13908]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Sep 20 11:42:53 MDA1PFP-S01 crmd[13908]:  notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Sep 20 11:42:53 MDA1PFP-S01 pengine[13907]:  notice: On loss of CCM Quorum: Ignore
Sep 20 11:42:53 MDA1PFP-S01 pengine[13907]: warning: Processing failed op start for mda-ip on MDA1PFP-PCS01: not configured (6)
Sep 20 11:42:53 MDA1PFP-S01 pengine[13907]:   error: Preventing mda-ip from re-starting anywhere: operation start failed 'not configured' (6)
Sep 20 11:42:53 MDA1PFP-S01 pengine[13907]: warning: Forcing mda-ip away from MDA1PFP-PCS01 after 1000000 failures (max=1000000)
Sep 20 11:42:53 MDA1PFP-S01 pengine[13907]:  notice: Calculated Transition 7: /var/lib/pacemaker/pengine/pe-input-554.bz2
Sep 20 11:42:53 MDA1PFP-S01 crmd[13908]:  notice: Transition 7 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-554.bz2): Complete
Sep 20 11:42:53 MDA1PFP-S01 crmd[13908]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Sep 20 11:43:02 MDA1PFP-S01 crmd[13908]:  notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Sep 20 11:43:02 MDA1PFP-S01 pengine[13907]:  notice: On loss of CCM Quorum: Ignore
Sep 20 11:43:02 MDA1PFP-S01 pengine[13907]: warning: Processing failed op start for mda-ip on MDA1PFP-PCS01: not configured (6)
Sep 20 11:43:02 MDA1PFP-S01 pengine[13907]:   error: Preventing mda-ip from re-starting anywhere: operation start failed 'not configured' (6)
Sep 20 11:43:02 MDA1PFP-S01 pengine[13907]: warning: Forcing mda-ip away from MDA1PFP-PCS01 after 1000000 failures (max=1000000)
Sep 20 11:43:02 MDA1PFP-S01 pengine[13907]:  notice: Calculated Transition 8: /var/lib/pacemaker/pengine/pe-input-555.bz2
Sep 20 11:43:02 MDA1PFP-S01 crmd[13908]:  notice: Transition 8 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-555.bz2): Complete
Sep 20 11:43:02 MDA1PFP-S01 crmd[13908]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]

Cheers,
  Jens

--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.auer at cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI Group Inc. and its affiliates may be contained in this message. If you are not a recipient indicated or intended in this message (or responsible for delivery of this message to such person), or you think for any reason that this message may have been addressed to you in error, you may not use or copy or deliver this message to anyone else. In such case, you should destroy this message and are asked to notify the sender by reply e-mail.

________________________________________
Von: Auer, Jens [jens.auer at cgi.com]
Gesendet: Montag, 19. September 2016 16:36
An: Cluster Labs - All topics related to open-source clustering welcomed
Betreff: Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

Hi,

>> After the restart ifconfig still shows the device bond0 to be not RUNNING:
>> MDA1PFP-S01 09:07:54 2127 0 ~ # ifconfig
>> bond0: flags=5123<UP,BROADCAST,MASTER,MULTICAST>  mtu 1500
>>         inet 192.168.120.20  netmask 255.255.255.255  broadcast 0.0.0.0
>>         ether a6:17:2c:2a:72:fc  txqueuelen 30000  (Ethernet)
>>         RX packets 2034  bytes 286728 (280.0 KiB)
>>         RX errors 0  dropped 29  overruns 0  frame 0
>>         TX packets 2284  bytes 355975 (347.6 KiB)
>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

There seems to be some difference because the device is not RUNNING;
mdaf-pf-pep-spare 14:17:53 999 0 ~ # ifconfig
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
        inet 192.168.120.10  netmask 255.255.255.0  broadcast 192.168.120.255
        inet6 fe80::5eb9:1ff:fe9c:e7fc  prefixlen 64  scopeid 0x20<link>
        ether 5c:b9:01:9c:e7:fc  txqueuelen 30000  (Ethernet)
        RX packets 15455692  bytes 22377220306 (20.8 GiB)
        RX errors 0  dropped 2392  overruns 0  frame 0
        TX packets 14706747  bytes 21361519159 (19.8 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Also the netmask and the ip address are wrong. I have configured the device to 192.168.120.10 with netmask 192.168.120.10. How does IpAddr2 get the wrong configuration? I have no idea.

>Anyway, you should rather be using "ip" command from iproute suite
>than various if* tools that come short in some cases:
>http://inai.de/2008/02/19
>This would also be consistent with IPaddr2 uses under the hood.

We are using RedHat 7 and this uses either NetworkManager or the network scripts. We use the later and ifup/ifdown should be the correct way to use the network card. I also tried using ip link set dev bond0 up/down and it brings up the device with the correct ip address and network mask.

Best wishes,
  Jens

--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.auer at cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI Group Inc. and its affiliates may be contained in this message. If you are not a recipient indicated or intended in this message (or responsible for delivery of this message to such person), or you think for any reason that this message may have been addressed to you in error, you may not use or copy or deliver this message to anyone else. In such case, you should destroy this message and are asked to notify the sender by reply e-mail.

________________________________________
Von: Jan Pokorný [jpokorny at redhat.com]
Gesendet: Montag, 19. September 2016 14:57
An: Cluster Labs - All topics related to open-source clustering welcomed
Betreff: Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

On 19/09/16 09:15 +0000, Auer, Jens wrote:
> After the restart ifconfig still shows the device bond0 to be not RUNNING:
> MDA1PFP-S01 09:07:54 2127 0 ~ # ifconfig
> bond0: flags=5123<UP,BROADCAST,MASTER,MULTICAST>  mtu 1500
>         inet 192.168.120.20  netmask 255.255.255.255  broadcast 0.0.0.0
>         ether a6:17:2c:2a:72:fc  txqueuelen 30000  (Ethernet)
>         RX packets 2034  bytes 286728 (280.0 KiB)
>         RX errors 0  dropped 29  overruns 0  frame 0
>         TX packets 2284  bytes 355975 (347.6 KiB)
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

This seems to suggest bond0 interface is up and address-assigned
(well, the netmask is strange).  So there would be nothing
contradictory to what I said on the address of IPaddr2.

Anyway, you should rather be using "ip" command from iproute suite
than various if* tools that come short in some cases:
http://inai.de/2008/02/19
This would also be consistent with IPaddr2 uses under the hood.

--
Jan (Poki)

_______________________________________________
Users mailing list: Users at clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: Users at clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org