[Pacemaker] Help with Pacemaker 2-node Router Setup

Sun Dec 27 01:33:05 EST 2009

Eric Renfro wrote:
> Michael Schwartzkopff wrote:
>> Am Samstag, 26. Dezember 2009 11:55:57 schrieb Eric Renfro:
>>  
>>> Michael Schwartzkopff wrote:
>>>    
>>>> Am Samstag, 26. Dezember 2009 11:27:54 schrieb Eric Renfro:
>>>>      
>>>>> Michael Schwartzkopff wrote:
>>>>>        
>>>>>> Am Samstag, 26. Dezember 2009 10:52:38 schrieb Eric Renfro:
>>>>>>          
>>>>>>> Michael Schwartzkopff wrote:
>>>>>>>            
>>>>>>>> Am Samstag, 26. Dezember 2009 08:12:49 schrieb Eric Renfro:
>>>>>>>>              
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I'm trying to setup 2 nodes that'll run pacemaker with openais as
>>>>>>>>> the communication layer. Ideally what I want is for router1 to be
>>>>>>>>> the master node and take over for router2 if it comes back up 
>>>>>>>>> fully
>>>>>>>>> functional again. In my setup, the routers are both 
>>>>>>>>> internet-facing
>>>>>>>>> servers that toggle the external internet IP to whichever 
>>>>>>>>> controls
>>>>>>>>> it at the time, and also handles the internal IP for the 
>>>>>>>>> gateway for
>>>>>>>>> internal systems to route via.
>>>>>>>>>
>>>>>>>>> My problem is with Route in my setup, so far, and later getting
>>>>>>>>> shorewall to start/stop per whichever nodes active.
>>>>>>>>>
>>>>>>>>> Route, in my case in the setup I will show below, is failing to
>>>>>>>>> start initially because I presume the internet IP address is not
>>>>>>>>> fully initialized at the time it's trying to enable the route. 
>>>>>>>>> If I
>>>>>>>>> do a crm resource cleanup failover-gw, it brings it up just 
>>>>>>>>> fine. If
>>>>>>>>> I try to move the router_cluster resource to router2 from router1
>>>>>>>>> after it's fully up, it fails because of failover-gw on router2.
>>>>>>>>>                 
>>>>>>>> Very unlikely. If the IPaddr2 script finishes the IP address is 
>>>>>>>> up.
>>>>>>>> Please search for other reasons and grep "lrm.*failover-gw" in the
>>>>>>>> logs.
>>>>>>>>
>>>>>>>>              
>>>>>>>>> Here's my setup at present. For the moment, until I figure out 
>>>>>>>>> how
>>>>>>>>> to do it, shorewall is started manually, I want to automate this
>>>>>>>>> once the setup is working, though, perhaps you guys could help me
>>>>>>>>> with that as well.
>>>>>>>>>
>>>>>>>>> primitive failover-int-ip ocf:heartbeat:IPaddr2 \
>>>>>>>>>         params ip="192.168.0.1" \
>>>>>>>>>         op monitor interval="2s"
>>>>>>>>> primitive failover-ext-ip ocf:heartbeat:IPaddr2 \
>>>>>>>>>         params ip="24.227.124.158" cidr_netmask="30"
>>>>>>>>> broadcast="24.227.124.159" nic="net0" \
>>>>>>>>>         op monitor interval="2s" \
>>>>>>>>>         meta target-role="Started"
>>>>>>>>> primitive failover-gw ocf:heartbeat:Route \
>>>>>>>>>         params destination="0.0.0.0/0" gateway="24.227.124.157"
>>>>>>>>> device="net0" \
>>>>>>>>>         meta target-role="Started" \
>>>>>>>>>         op monitor interval="2s"
>>>>>>>>> group router_cluster failover-int-ip failover-ext-ip failover-gw
>>>>>>>>> location router-master router_cluster \
>>>>>>>>>         rule $id="router-master-rule" $role="master" 100: 
>>>>>>>>> #uname eq
>>>>>>>>> router1
>>>>>>>>>
>>>>>>>>> I would appreciate as much help as possible. I am fairly new to
>>>>>>>>> pacemaker, but so far all but the Route part of this works well.
>>>>>>>>>                 
>>>>>>>> Please give us a chance to help you providing the interesting 
>>>>>>>> logs!
>>>>>>>>               
>>>>>>> Sure..
>>>>>>> Here's a big clip of a log grepped from just failover-gw, if this
>>>>>>> helps hopefully, else, I can pinpoint more around what's happening,
>>>>>>> the logs fill up pretty quickly as it's coming alive.
>>>>>>>
>>>>>>> messages:Dec 26 02:00:21 router1 pengine: [4724]: info: 
>>>>>>> unpack_rsc_op:
>>>>>>> failover-gw_monitor_0 on router2 returned 5 (not installed) 
>>>>>>> instead of
>>>>>>> the expected value: 7 (not running)
>>>>>>>             
>>>>>> (...)
>>>>>>
>>>>>> The rest of the logs is not needed. Just the first line tells you 
>>>>>> that
>>>>>> that something is not installed correctly. Please read the lines 
>>>>>> just
>>>>>> abobe this line. Normally it tells you what is missing.
>>>>>>
>>>>>> You also your read trough the routing resource agent in
>>>>>> /usr/lib/ocf/resource.d/heartbeat/Route
>>>>>>
>>>>>> Greetings,
>>>>>>           
>>>>> Hmmm..
>>>>> I'm not seeing anything about it, here's a clip of the above 
>>>>> lines, and
>>>>> one line below the one saying (not installed).
>>>>>
>>>>> Dec 26 05:00:21 router1 pengine: [4724]: info: 
>>>>> determine_online_status:
>>>>> Node router1 is online
>>>>> Dec 26 05:00:21 router1 pengine: [4724]: info: unpack_rsc_op:
>>>>> failover-gw_monitor_0 on router1 returned 0 (ok) instead of the 
>>>>> expect
>>>>> ed value: 7 (not running)
>>>>> Dec 26 05:00:21 router1 pengine: [4724]: WARN: unpack_rsc_op: 
>>>>> Operation
>>>>> failover-gw_monitor_0 found resource failover-gw active on r
>>>>> outer1
>>>>> Dec 26 05:00:21 router1 pengine: [4724]: info: 
>>>>> determine_online_status:
>>>>> Node router2 is online
>>>>> Dec 26 05:00:21 router1 pengine: [4724]: info: unpack_rsc_op:
>>>>> failover-gw_monitor_0 on router2 returned 5 (not installed) 
>>>>> instead of
>>>>>  the expected value: 7 (not running)
>>>>> Dec 26 05:00:21 router1 pengine: [4724]: ERROR: unpack_rsc_op: Hard
>>>>> error - failover-gw_monitor_0 failed with rc=5: Preventing 
>>>>> failover-gw
>>>>> from re-starting on router2
>>>>>         
>>>> Hi,
>>>>
>>>> there must be other log entries. In the Router RA I have before err 
>>>> out
>>>> the agent write reasons into the ocf_log(). What version of 
>>>> pacemaker and
>>>> cluster- glue do you have? What distribution you a running on?
>>>>
>>>> Greetings,
>>>>       
>>> I've checked all my logs. Syslog logs everything to my messages 
>>> logfile,
>>> so it should be there if anywhere.
>>>
>>> I'm running OpenSUSE 11.2 which comes with heartbeat 2.99.3, pacemaker
>>> 1.0.1, openais 0.80.3, as to what all's running in this setup.
>>>     
>>
>> Hm. This is already a quite old verison of pacemaker. But it should 
>> run anyway. Please could you check the resource manually on router1.
>>
>> export OCF_ROOT=/usr/lib/ocf
>> export OCF_RESKEY_destination="0.0.0.0/0"
>> export OCF_RESKEY_gateway="24.227.124.157"
>>
>> /usr/lib/ocf/resource.d/heartbeat/Route monitor; echo $?
>> should reult in 0 (started) or 7 (not started)
>>
>> /usr/lib/ocf/resource.d/heartbeat/Route start; echo $?
>> should add the default route and result in 0
>>
>> /usr/lib/ocf/resource.d/heartbeat/Route monitor; echo $?
>> should result in 0 (started)
>>
>> /usr/lib/ocf/resource.d/heartbeat/Route stop; echo $?
>> should delete the default route and result in 0
>>
>> /usr/lib/ocf/resource.d/heartbeat/Route monitor; echo $?
>> should result in 7 (not started)
>>
>> If this works not as expected, are the any error message?
>> Please see if you can debug the Route script.
>>
>> Greetings,
>>
>>   
> I did all these tests, and all results came back normal. First monitor 
> returned 7, not started, after starting, returned 0 and monitor 
> returned 0, stop returned 0, and monitor after stopping returned 7.
>
> Seems the error for me is further up initiallly which causes it to not 
> start afterwards. Here's the current setup:
>
> primitive intIP ocf:heartbeat:IPaddr2 \
>         params ip="192.168.0.1" cidr_netmask="16" 
> broadcast="192.168.255.255" nic="lan0"
> primitive extIP ocf:heartbeat:IPaddr2 \
>         params ip="24.227.124.158" cidr_netmask="30" 
> broadcast="24.227.124.159" nic="net0"
> primitive resRoute ocf:heartbeat:Route \
>         params destination="0.0.0.0/0" gateway="24.227.124.157" \
> primitive firewall lsb:shorewall
> group router_cluster extIP intIP resRoute shorewall
> location router-master router_cluster \
>         rule $id="router-master-rule" $role="master" 100: #uname eq 
> router1
>
> I have added blank lines in the logs to separate out specific event 
> segments that shows it. One in particular, near the top specifically 
> is what's causing the entire resRoute to fail completely:
>
> Dec 27 00:24:40 router2 crmd: [25786]: info: process_lrm_event: LRM 
> operation resRoute_monitor_0 (call=4, rc=5, cib-update=31, 
> confirmed=true) complete not installed
>
> This is with OpenSUSE 11.1 with the ha-cluster repository used with 
> pacemaker 1.0.5, cluster-glue 1.0, heartbeat 3.0.0, openais 0.80.5, 
> ha-resources 1.0 (which is heartbeat 3.99.x stuff I do believe). So 
> fairly current versions now.
>
> I'd been making my setup off of susestudio and hand picking the 
> packages needed.
>
> Any thoughts?
>
> -- 
> Eric Renfro
>

Aha!
The problem is in the Route script itself somewhere. In doing the same 
tests you exampled earlier, the very first monitor attempt on Route, 
while the net0 interface is empty and offline, I get the error that was 
shown in the previous log snipped:

Route[26705]: ERROR: Gateway address 24.227.124.157 is unreachable.

So the problem is, Route fails with an incorrect error code when it's 
just that it can't create a route because the interface is currently 
offline. It should report 7 because it's not started.

After looking again at: 
http://hg.linux-ha.org/agents/log/56b9100f9c49/heartbeat/Route
and then finding out ocf_is_probe was non-existant, looking at: 
http://hg.linux-ha.org/agents/file/56b9100f9c49/heartbeat/.ocf-shellfuncs.in

I was able to patch together a fix that worked. The script of shellfuncs 
in that example didn't work with the -a $OCF_RESKEY_CRM_meta_interval 
because of too many arguments, but omitting that part entirely resolved 
the issue overall. It successfully brings the route up right from the start.

Now that that issue is resolved for the time being....

How would I make it once it takes down the route from resRoute when 
passing control back to the master server, it activates an alternative 
route to get itself back online through the 192.168.0.1 gateway? I don't 
even know where to begin to get this logic in place. All I know is it 
has something to do with colocation but how exactly I'm uncertain. Any 
advice and examples would be grateful.

--
Eric Renfro