[Pacemaker] question about interface failover

Tue May 21 08:04:29 EDT 2013

Le 18/05/2013 20:23, christopher barry a écrit :
> On Fri, 2013-05-17 at 10:41 +0200, Florian Crouzat wrote:
>> Le 16/05/2013 21:45, christopher barry a écrit :
>>> Greetings,
>>>
>>> I've setup a new 2-node mysql cluster using
>>> * drbd 8.3.1.3
>>> * corosync 1.4.2
>>> * pacemaker 117
>>> on Debian Wheezy nodes.
>>>
>>> failover seems to be working fine for everything except the ips manually
>>> configured on the interfaces.
>>
>> This sentence makes no sense to me.
>> The cluster will not failover something that is not clusterized (a
>> 'manually' configured IP...)
>>
>> What are you trying to achieve exactly ?
>> Also, could you pastebin the output of "crm_mon -Arf1" I find it more
>> easy to read.
>>
>>
>>>
>>> see config here:
>>> http://pastebin.aquilenet.fr/?9eb51f6fb7d65fda#/YvSiYFocOzogAmPU9g
>>> +g09RcJvhHbgrY1JuN7D+gA4=
>>>
>>> If I bring down an interface, when the cluster restarts it, it only
>>> starts it with the vip - the original ip and route have been removed.
>>
>> Makes sense if you added the 'original' IP manually...
>> You should have non-VIP in /etc/sysconfig/network/ifcfg-*
>> But then again, please precise what you are trying to achieve.
>>
>>>
>>> not sure what to do to make sure the permanent ip and the routes get
>>> restored. I'm not all that versed on the cluster commandline yet, and
>>> I'm using LCMC for most of my usage.
>>
>>
>
> (@howard2.rjmetrics.com)-(14:00 / Sat May 18)
> [-][~]# crm_mon -Arf1
> ============
> Last updated: Sat May 18 14:00:27 2013
> Last change: Thu May 16 17:33:07 2013 via crm_attribute on
> howard3.rjmetrics.com
> Stack: openais
> Current DC: howard3.rjmetrics.com - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, 2 expected votes
> 6 Resources configured.
> ============
>
> Online: [ howard3.rjmetrics.com howard2.rjmetrics.com ]
>
> Full list of resources:
>
>   Master/Slave Set: ms_drbd_mysql [p_drbd_mysql]
>       Masters: [ howard2.rjmetrics.com ]
>       Slaves: [ howard3.rjmetrics.com ]
>   Resource Group: g_mysql
>       p_fs_mysql	(ocf::heartbeat:Filesystem):	Started
> howard2.rjmetrics.com
>       ClusterPrivateIP	(ocf::heartbeat:IPaddr2):	Started
> howard2.rjmetrics.com
>       ClusterPublicIP	(ocf::heartbeat:IPaddr2):	Started
> howard2.rjmetrics.com
>       p_mysql	(ocf::heartbeat:mysql):	Started howard2.rjmetrics.com
>
> Node Attributes:
> * Node howard3.rjmetrics.com:
>      + master-p_drbd_mysql:0           	: 1000
> * Node howard2.rjmetrics.com:
>      + master-p_drbd_mysql:1           	: 10000
>
> Migration summary:
> * Node howard3.rjmetrics.com:
>     p_drbd_mysql:1: migration-threshold=1000000 fail-count=1
> * Node howard2.rjmetrics.com:
>     ClusterPublicIP: migration-threshold=1000000 fail-count=1
>
> Failed actions:
>      p_drbd_mysql:1_promote_0 (node=howard3.rjmetrics.com, call=29,
> rc=-2, status=Timed Out): unknown exec error
>      ClusterPublicIP_monitor_30000 (node=howard2.rjmetrics.com, call=122,
> rc=7, status=complete): not running
>
>
> howard2 and howard3 are the two clustered servers.
>
> During testing, when I ifdown either eth0 or eth1, the cluster starts
> the vip back up, but the other non-vip IPs and routes do not get
> started. I'm running Debian, so these are configured
> in /etc/network/interfaces. Saying 'manually' configured was misleading
> on my part, sorry about that.

Mhh, I cannot reproduce right now but I was pretty sure that IPaddr2 
used "ip addr add X.X.X.X/YY dev ZZ" so I was expecting that ifdowning 
device ZZ would prevent pacemaker to re-up the VIP as the underlaying 
device doesn't exists anymore.
It's even proved by the fact that the non-vip doesn't come up again: 
IPaddr2 doesn't ifup, it add an alias to an existing device.
See "sudo crm ra meta IPaddr2" and search for "nic="

Anyway, "ifdown" is not a valid use case to test your cluster, this 
doesn't represent any possible valid production scenario.

>
> eth0 is the public interface, and eth1 is the private interface. eth2
> and eth3 are bonded as bond0, use jumbo frames, and are crossover cabled
> between the nodes.
>
> The test I was doing was to pull cables from eth0 and eth1, which hung
> the cluster. My assumption is that I need to add more configuration
> elements to manage the other IPs and also setup some ping hosts that
> when unreachable will initiate failover. What would help me I think is
> an example config or pointers to how to add these elements.

Well, without digging much in your configuration, you need ping-nodes 
yes so that your most connected nodes "wins", and you also need fencing, 
that is mandatory on any cluster.

Here's sample configuration for ping nodes and a location constraing so 
that the most connected nodes hosts the resource "foo":

primitive ping-gw-sw1-sw2 ocf:pacemaker:ping \
         params host_list="192.168.10.1 192.168.2.11 192.168.2.12" 
dampen="35s" attempts="2" timeout="2" multiplier="100" \
         op monitor interval="15s"

clone ping-nq-sw-swsec-clone ping-gw-sw1-sw2 \
         meta target-role="Started"

location IPHA-on-connected-node foo \
         rule $id="IPHA-on-connected-node-rule" pingd: defined pingd

See 
http://www.hastexo.com/resources/hints-and-kinks/network-connectivity-check-pacemaker

>
> On another note, the test made the drbd link disconnect, with both disks
> now marked as standalone in the lcmc gui. Right-clicking the disks or
> the conenction does not allow any action other than view logs, which
> say:
>
> May 16 17:33:08 howard3 kernel: [781360.146362] block drbd0: Split-Brain
> detected but unresolved, dropping connection!
> May 16 17:33:08 howard3 kernel: [781360.146451] block drbd0: helper
> command: /sbin/drbdadm split-brain minor-0
> May 16 17:33:08 howard3 kernel: [781360.149042] block drbd0: helper
> command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
> May 16 17:33:08 howard3 kernel: [781360.149051] block drbd0:
> conn( WFReportParams -> Disconnecting )
> May 16 17:33:08 howard3 kernel: [781360.149060] block drbd0: error
> receiving ReportState, l: 4!
> May 16 17:33:08 howard3 kernel: [781360.149154] block drbd0: asender
> terminated
> May 16 17:33:08 howard3 kernel: [781360.149159] block drbd0: Terminating
> drbd0_asender
> May 16 17:33:08 howard3 kernel: [781360.149609] block drbd0: Connection
> closed
> May 16 17:33:08 howard3 kernel: [781360.149619] block drbd0:
> conn( Disconnecting -> StandAlone )
> May 16 17:33:08 howard3 kernel: [781360.149811] block drbd0: receiver
> terminated
> May 16 17:33:08 howard3 kernel: [781360.149815] block drbd0: Terminating
> drbd0_receiver
>
> I'm really not sure how to proceed. Please let me know any additional
> information you may need.

I know nothing about shared storage.

>
> Thanks for your time Florian, it's much appreciated.
>

You'r welcome.

-- 
Cheers,
Florian Crouzat