[Pacemaker] Issues with HA cluster for mysqld

Thu Aug 23 16:19:05 EDT 2012

----- Original Message -----
> From: "David Parker" <dparker at utica.edu>
> To: pacemaker at oss.clusterlabs.org
> Sent: Thursday, August 23, 2012 12:56:33 PM
> Subject: Re: [Pacemaker] Issues with HA cluster for mysqld
> 
> 
> On 08/23/2012 10:17 AM, David Parker wrote:
> > On 08/23/2012 09:01 AM, Jake Smith wrote:
> >> ----- Original Message -----
> >>> From: "David Parker"<dparker at utica.edu>
> >>> To: pacemaker at oss.clusterlabs.org
> >>> Sent: Wednesday, August 22, 2012 2:49:32 PM
> >>> Subject: [Pacemaker] Issues with HA cluster for mysqld
> >>>
> >>> Hello,
> >>>
> >>> I'm trying to set up a 2-node, active-passive HA cluster for
> >>> MySQL
> >>> using
> >>> heartbeat and Pacemaker.  The operating system is Debian Linux
> >>> 6.0.5
> >>> 64-bit, and I am using the heartbeat packages installed via
> >>> apt-get.
> >>> The servers involved are the SQL nodes of a running MySQL
> >>> cluster, so
> >>> the only service I need HA for is the MySQL daemon (mysqld).
> >>>
> >>> What I would like to do is have a single virtual IP address which
> >>> clients use to query MySQL, and have the IP and mysqld fail over
> >>> to
> >>> the
> >>> passive node in the event of a failure on the active node.  I
> >>> have
> >>> read
> >>> through a lot of the heartbeat and Pacemaker documentation, and
> >>> here
> >>> are
> >>> the resources I have configured for the cluster:
> >>>
> >>> * A custom LSB script for mysqld (compliant with Pacemaker's
> >>> requirements as outlined in the documentation)
> >>> * An iLO2-based STONITH device using riloe (both servers are HP
> >>> Proliant
> >>> DL380 G5)
> >>> * A virtual IP address for mysqld using IPaddr2
> >>>
> >>> I believe I have configured everything correctly, but I'm not
> >>> positive.
> >>> Anyway, when I start heartbeat and pacemaker
> >>> (/etc/init.d/heartbeat
> >>> start), everything seems to be ok.  However, the virtual IP never
> >>> comes
> >>> up, and the output of "crm_resource -LV" indicates that something
> >>> is
> >>> wrong:
> >>>
> >>> root at ha1:~# crm_resource -LV
> >>> crm_resource[28988]: 2012/08/22_14:41:23 WARN: unpack_rsc_op:
> >>> Processing
> >>> failed op stonith_start_0 on ha1: unknown error (1)
> >>>    stonith        (stonith:external/riloe) Started
> >>>    MysqlIP        (ocf::heartbeat:IPaddr2) Stopped
> >>>    mysqld (lsb:mysqld) Started
> >> It looks like you only have one STONITH resource defined... you
> >> need
> >> one per server (or to clone the one but that usually applies in
> >> blades not standalone servers).  And then you would add location
> >> constraints not allowing ha1's stonith to run on ha1 and ha2's
> >> stonith not run on ha2 (can't shoot yourself).  That way each
> >> server
> >> has the ability to stonith the other. Nothing *should* run if your
> >> stonith fails and you have stonith enabled.
> >>
> >> HTH
> >>
> >> Jake
> >
> > Thanks!  Can you clarify how I would go about putting those
> > constraints in place?  I've been following Andrew's "Configuration
> > Explained" document, and I think I have a grasp on most of these
> > things, but it's not clear to me how I can constrain a STONITH
> > device
> > to only one node.  Also, following the example in the
> > documentation, I
> > added these location constraints to the other resources:
> >
> > <constraints>
> > <rsc_location id="loc-1" rsc="MysqlIP" node="ha1" score="200"/>
> > <rsc_location id="loc-2" rsc="MysqlIP" node="ha2" score="0"/>
> > <rsc_location id="loc-3" rsc="mysqld" node="ha1" score="200"/>
> > <rsc_location id="loc-4" rsc="mysqld" node="ha2" score="0"/>
> > </constraints>
> >
> > I'm trying to make ha1 the preferred node for both mysqld and the
> > virtual IP.  Do these look correct for that?
> >
> >>> When I attempt to stop heartbeat and Pacemaker
> >>> (/etc/init.d/heartbeat
> >>> stop) it says "Stopping High-Availability services:" and then
> >>> hangs
> >>> for
> >>> about 5 minutes before finally stopping the services.
> >>>
> >>> So, I'm left with a couple of questions.  Is there something
> >>> wrong
> >>> with
> >>> my configuration?  Is there a reason why the HA services can't
> >>> shut
> >>> down
> >>> in a timely manner?  Is there something else I need to do to get
> >>> the
> >>> virtual IP working?  Thanks in advance for any help!
> >
> > Would the misconfigured STONITH resources be causing the long
> > shutdown
> > delays?
> >
> 
> Okay, I think I've almost got this.  I updated my Pacemaker config
> and
> made a few changes.  I put the MysqlIP and mysqld primitives into a
> resource group called "mysqld-resources", ordered them such that
> mysqld
> will always wait for MysqlIP to be ready first, and added constraints
> to
> make ha1 the preferred host for the mysqld-resources group and ha2
> the
> failover host.  I also created STONITH devices for both ha1 and ha2,
> and
> added constraints to fix the STONIOTH location issues.  My new
> constraints section looks like this:
> 
> <constraints>
> <rsc_location id="loc-1" rsc="stonith-ha1" node="ha2"
> score="INFINITY"/>
> <rsc_location id="loc-2" rsc="stonith-ha2" node="ha1"
> score="INFINITY"/>

Don't need the 2 above as long as you have the 2 negative locations below for stonith locations.  I prefer the negative below because if you ever expanded to greater than 2 nodes the stonith for any node could run on any node but itself.

> <rsc_location id="loc-3" rsc="stonith-ha1" node="ha1"
> score="-INFINITY"/>
> <rsc_location id="loc-4" rsc="stonith-ha2" node="ha2"
> score="-INFINITY"/>
> <rsc_location id="loc-5" rsc="mysql-resources" node="ha1"
> score="200"/>

Don't need the 0 score below either - the 200 above will take care of it.  Pretty sure no location constraint is the same as a 0 score location.

> <rsc_location id="loc-6" rsc="mysql-resources" node="ha2" score="0"/>
> </constraints>
> 
> Everything seems to work.  I had the virtual IP and mysqld running on
> ha1, and not on ha2.  I shut down ha1 using "poweroff -n" and both
> the
> virtual IP and mysqld came up on ha2 almost instantly.  When I
> powered
> ha1 on again, ha2 shut down the the virtual IP and mysqld.  The
> virtual
> IP moved over instantly; a continuous ping of the IP produced one
> "Time
> to live exceeded" message and one packet was lost, but that's to be
> expected.  However, mysqld took almost 30 seconds to start up on ha1
> after being stopped on ha2, and I'm not exactly sure why.
> 
> Here's the relevant log output from ha2:
> 
> Aug 23 11:42:48 ha2 crmd: [1166]: info: te_rsc_command: Initiating
> action 16: stop mysqld_stop_0 on ha2 (local)
> Aug 23 11:42:48 ha2 crmd: [1166]: info: do_lrm_rsc_op: Performing
> key=16:1:0:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_stop_0 )
> Aug 23 11:42:48 ha2 lrmd: [1163]: info: rsc:mysqld:10: stop
> Aug 23 11:42:50 ha2 lrmd: [1163]: info: RA output:
> (mysqld:stop:stdout)
> Stopping MySQL daemon: mysqld_safe.
> Aug 23 11:42:50 ha2 crmd: [1166]: info: process_lrm_event: LRM
> operation
> mysqld_stop_0 (call=10, rc=0, cib-update=57, confirmed=true) ok
> Aug 23 11:42:50 ha2 crmd: [1166]: info: match_graph_event: Action
> mysqld_stop_0 (16) confirmed on ha2 (rc=0)
> 
> And here's the relevant log output from ha1:
> 
> Aug 23 11:42:47 ha1 crmd: [1243]: info: do_lrm_rsc_op: Performing
> key=8:1:7:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_monitor_0 )
> Aug 23 11:42:47 ha1 lrmd: [1240]: info: rsc:mysqld:5: probe
> Aug 23 11:42:47 ha1 crmd: [1243]: info: process_lrm_event: LRM
> operation
> mysqld_monitor_0 (call=5, rc=7, cib-update=10, confirmed=true) not
> running
> Aug 23 11:43:36 ha1 crmd: [1243]: info: do_lrm_rsc_op: Performing
> key=11:3:0:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_start_0 )
> Aug 23 11:43:36 ha1 lrmd: [1240]: info: rsc:mysqld:11: start
> Aug 23 11:43:36 ha1 lrmd: [1240]: info: RA output:
> (mysqld:start:stdout)
> Starting MySQL daemon: mysqld_safe.#012(See
> /usr/local/mysql/data/mysql.messages for messages).
> Aug 23 11:43:36 ha1 crmd: [1243]: info: process_lrm_event: LRM
> operation
> mysqld_start_0 (call=11, rc=0, cib-update=18, confirmed=true) ok
> 
> So, ha2 stopped mysqld at 11:42:50, but ha1 didn't start mysqld until
> 11:43:36, a full 46 seconds after it was stopped on ha2.  Any ideas
> why
> the delay for mysqld was so long, when the MysqlIP resource moved
> almost
> instantly?

Couple thoughts.

Are you sure both servers have the same time (in sync)?

On HA2 did verify mysqld was actually done stopping at the 11:42:50 mark? 
I don't use mysql so I can't say from experience.

Just curious but do you really want it to failback if it's actively running on ha2?

Could you include the output of '$crm configure show' next time?  I read that much better/quicker than the xml pacemaker config :-)

Jake