[Pacemaker] Recovery after simple master-master failover

Tue Feb 21 18:09:30 EST 2012

----- Original Message -----
> From: "David Gubler" <dg at doodle.com>
> To: pacemaker at oss.clusterlabs.org
> Sent: Tuesday, February 21, 2012 8:04:34 AM
> Subject: [Pacemaker] Recovery after simple master-master failover
> 
> Hi list,
> 
> We have two entry servers (running Apache on Debian Squeeze/Pacemaker
> 1.0.9 with Heartbeat), both of which are active at the same time.
> Users
> may use any of these two servers at any time.
> 
> Now, if one of them fails, users should all be redirected to the
> other
> server, as transparently as possible, using two virtual IP addresses.
> 
> I absolutely don't want pacemaker interfering with Apache - All I
> want
> it to do is monitor Apache and move the IP address if it goes down.
> 
> 
> Thus, I set up this configuration (simplified, IPv6 removed):
> 
> node $id="101b0c74-2fd5-46a5-bb65-702cb3188c11" entry1
> node $id="6ec6b85c-c44c-406d-97aa-1a8da56dc041" entry2
> primitive apache ocf:heartbeat:apache \
> 	params statusurl="http://localhost/server-status" \
> 	op monitor interval="30s" \
> 	meta is-managed="false"
> primitive siteIp4A ocf:heartbeat:IPaddr \
> 	params ip="188.92.145.78" cidr_netmask="255.255.255.192" nic="eth0"
> 	\
> 	op monitor interval="15s"
> primitive siteIp4B ocf:heartbeat:IPaddr \
> 	params ip="188.92.145.79" cidr_netmask="255.255.255.192" nic="eth0"
> 	\
> 	op monitor interval="15s"
> clone apacheClone apache
> colocation coloDistribute -100: siteIp4A siteIp4B
> colocation coloSiteA inf: siteIp4A apacheClone
> colocation coloSiteB inf: siteIp4B apacheClone
> property $id="cib-bootstrap-options" \
> 	dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
> 	cluster-infrastructure="Heartbeat" \
> 	stonith-enabled="false" \
> 	last-lrm-refresh="1329758239"
> 
> 
> Yes, I know the usual disclamer about stonith, but we don't care,
> because the worst thing that could happen is that both nodes take
> both
> IP addresses, which is a risk we can totally live with. Even if that
> situation happens, pacemaker recovers from it as soon as the two
> nodes
> see each other again.
> 
> So far, so good, failover appears to work (e.g. if I simulate a
> monitor
> failure using iptables to cut of the monitor), but:
> 
> 1. After the failed Apache comes back up, pacemaker doesn't notice
> this,
> unless I do a manual resource cleanup. I think this is because the
> monitor is stopped on failure. I have played with
> monitor on-fail="ignore" and "restart"
> and
> failure-timeout=60s
> on the "apache" primitive, but no luck - The cluster doesn't notice
> that
> Apache is back up.
> I need this to happen automatically, because monitor failures can
> happen
> from time to time, and I do not want to use migration-threshold
> because
> I really want a quick failover.
> Yes, I know, I can do a cronjob doing cleanup every minute, but that
> cannot be the way to go, right? Especially since that might have
> other
> side effects (IPs stopped during cleanup or the like?)
> 

Still probably not the nicest/cleanest solution but you could do a cronjob that runs 'crm resource reprobe node_name'.  That will check for resources the cluster didn't start and prevent the cleanup actions.

> 2. When I reconfigure things or restart Heartbeat (and Pacemaker with
> it), the apache primitive can get into the "orphaned" state, which
> means
> that Pacemaker will stop it. While this may be reasonnable for the IP
> primitives, it looks like a bug for a resource with
> is-managed="false"
> (I mean, which part of "do not start or stop this resource" does
> Pacemaker not understand?). Unfortunately, I couldn't find any way to
> disable this behaviour except for the global "stop-orphan-actions"
> option, which is probably not what I want. Am I missing something
> here?

what about an 'on-fail' in the op monitor section - probably with an =ignore?
More on that one here:
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-resource-operations.html

HTH

Jake

> 
> I have spent hours trying to figure out how this is supposed to work,
> but no dice :(
> 
> Any help would be greatly appreciated. Thanks!
> 
> Best regards,
> 
> David
> 
> --
> David Gubler
> Senior Software & Operations Engineer
> MeetMe: http://doodle.com/david
> E-Mail: dg at doodle.com
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
>