[Pacemaker] Issues with HA cluster for mysqld

Fri Aug 24 03:17:00 EDT 2012

Hi,

On Thu, Aug 23, 2012 at 04:47:11PM -0400, David Parker wrote:
> 
> On 08/23/2012 04:19 PM, Jake Smith wrote:
> >>Okay, I think I've almost got this.  I updated my Pacemaker config
> >>and
> >>made a few changes.  I put the MysqlIP and mysqld primitives into a
> >>resource group called "mysqld-resources", ordered them such that
> >>mysqld
> >>will always wait for MysqlIP to be ready first, and added constraints
> >>to
> >>make ha1 the preferred host for the mysqld-resources group and ha2
> >>the
> >>failover host.  I also created STONITH devices for both ha1 and ha2,
> >>and
> >>added constraints to fix the STONIOTH location issues.  My new
> >>constraints section looks like this:
> >>
> >><constraints>
> >><rsc_location id="loc-1" rsc="stonith-ha1" node="ha2"
> >>score="INFINITY"/>
> >><rsc_location id="loc-2" rsc="stonith-ha2" node="ha1"
> >>score="INFINITY"/>
> >Don't need the 2 above as long as you have the 2 negative locations below for stonith locations.  I prefer the negative below because if you ever expanded to greater than 2 nodes the stonith for any node could run on any node but itself.
> 
> Good call.  I'll take those out of the config.
> 
> >><rsc_location id="loc-3" rsc="stonith-ha1" node="ha1"
> >>score="-INFINITY"/>
> >><rsc_location id="loc-4" rsc="stonith-ha2" node="ha2"
> >>score="-INFINITY"/>
> >><rsc_location id="loc-5" rsc="mysql-resources" node="ha1"
> >>score="200"/>
> >Don't need the 0 score below either - the 200 above will take care of it.  Pretty sure no location constraint is the same as a 0 score location.
> 
> That was based on the example found in the documentation.  If I
> don't have the 0 score entry, will the service still fail over?
> 
> >><rsc_location id="loc-6" rsc="mysql-resources" node="ha2" score="0"/>
> >></constraints>
> >>
> >>Everything seems to work.  I had the virtual IP and mysqld running on
> >>ha1, and not on ha2.  I shut down ha1 using "poweroff -n" and both
> >>the
> >>virtual IP and mysqld came up on ha2 almost instantly.  When I
> >>powered
> >>ha1 on again, ha2 shut down the the virtual IP and mysqld.  The
> >>virtual
> >>IP moved over instantly; a continuous ping of the IP produced one
> >>"Time
> >>to live exceeded" message and one packet was lost, but that's to be
> >>expected.  However, mysqld took almost 30 seconds to start up on ha1
> >>after being stopped on ha2, and I'm not exactly sure why.
> >>
> >>Here's the relevant log output from ha2:
> >>
> >>Aug 23 11:42:48 ha2 crmd: [1166]: info: te_rsc_command: Initiating
> >>action 16: stop mysqld_stop_0 on ha2 (local)
> >>Aug 23 11:42:48 ha2 crmd: [1166]: info: do_lrm_rsc_op: Performing
> >>key=16:1:0:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_stop_0 )
> >>Aug 23 11:42:48 ha2 lrmd: [1163]: info: rsc:mysqld:10: stop
> >>Aug 23 11:42:50 ha2 lrmd: [1163]: info: RA output:
> >>(mysqld:stop:stdout)
> >>Stopping MySQL daemon: mysqld_safe.
> >>Aug 23 11:42:50 ha2 crmd: [1166]: info: process_lrm_event: LRM
> >>operation
> >>mysqld_stop_0 (call=10, rc=0, cib-update=57, confirmed=true) ok
> >>Aug 23 11:42:50 ha2 crmd: [1166]: info: match_graph_event: Action
> >>mysqld_stop_0 (16) confirmed on ha2 (rc=0)
> >>
> >>And here's the relevant log output from ha1:
> >>
> >>Aug 23 11:42:47 ha1 crmd: [1243]: info: do_lrm_rsc_op: Performing
> >>key=8:1:7:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_monitor_0 )
> >>Aug 23 11:42:47 ha1 lrmd: [1240]: info: rsc:mysqld:5: probe
> >>Aug 23 11:42:47 ha1 crmd: [1243]: info: process_lrm_event: LRM
> >>operation
> >>mysqld_monitor_0 (call=5, rc=7, cib-update=10, confirmed=true) not
> >>running
> >>Aug 23 11:43:36 ha1 crmd: [1243]: info: do_lrm_rsc_op: Performing
> >>key=11:3:0:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_start_0 )
> >>Aug 23 11:43:36 ha1 lrmd: [1240]: info: rsc:mysqld:11: start
> >>Aug 23 11:43:36 ha1 lrmd: [1240]: info: RA output:
> >>(mysqld:start:stdout)
> >>Starting MySQL daemon: mysqld_safe.#012(See
> >>/usr/local/mysql/data/mysql.messages for messages).
> >>Aug 23 11:43:36 ha1 crmd: [1243]: info: process_lrm_event: LRM
> >>operation
> >>mysqld_start_0 (call=11, rc=0, cib-update=18, confirmed=true) ok
> >>
> >>So, ha2 stopped mysqld at 11:42:50, but ha1 didn't start mysqld until
> >>11:43:36, a full 46 seconds after it was stopped on ha2.  Any ideas
> >>why
> >>the delay for mysqld was so long, when the MysqlIP resource moved
> >>almost
> >>instantly?
> >Couple thoughts.
> >
> >Are you sure both servers have the same time (in sync)?
> 
> Yep.  They're both using NTP.
> 
> >On HA2 did verify mysqld was actually done stopping at the 11:42:50 mark?
> >I don't use mysql so I can't say from experience.
> 
> Yes, I kept checking (with "ps -ef | grep mysqld") every few
> seconds, and it stopped running around that time.  As soon as it
> stopped running on ha2, I started checking on ha1 and it was quite a
> while before mysqld started.  I knew it was at least 30 seconds, and
> I believe it was actually 42 seconds as the logs indicate.
> 
> >Just curious but do you really want it to failback if it's actively running on ha2?
> 
> Interesting point.  I had just assumed that it was good practice to
> have a preferred node for a service, but I guess it doesn't matter.
> If I don't care which node the services run on, do I just remove the
> location constraints for the "mysql-resources" group altogether?
> 
> >Could you include the output of '$crm configure show' next time?  I read that much better/quicker than the xml pacemaker config :-)
> >
> >Jake
> 
> Thanks so much for all of your help, Jake!  I'm new to all of this,
> and I really appreciate it.
> 
> Here's the requested output:
> 
> root at ha1:~# crm configure show
> node $id="1b48f410-44d1-4e89-8b52-ff23b32db1bc" ha1
> node $id="9790fe6e-67b2-4817-abf4-966b5aa6948c" ha2
> primitive MysqlIP ocf:heartbeat:IPaddr2 \
>         params ip="192.168.25.9" cidr_netmask="32" \
>         op monitor interval="10s"
> primitive mysqld lsb:mysqld
> primitive stonith-ha1 stonith:external/riloe \
>         params hostlist="ha1" ilo_hostname="10.0.1.111"
> ilo_user="Administrator" ilo_password="XXXXXXXX" ilo_can_reset="1"
> ilo_protocol="2.0" ilo_powerdown_method="button"
> primitive stonith-ha2 stonith:external/riloe \
>         params hostlist="ha2" ilo_hostname="10.0.1.112"
> ilo_user="Administrator" ilo_password="XXXXXXXX" ilo_can_reset="1"
> ilo_protocol="2.0" ilo_powerdown_method="button"
> group mysql-resources MysqlIP mysqld
> location loc-1 stonith-ha1 inf: ha2
> location loc-2 stonith-ha2 inf: ha1

loc-1 and loc-2 are superfluous.

> location loc-3 stonith-ha1 -inf: ha1
> location loc-4 stonith-ha2 -inf: ha2

> location loc-5 mysql-resources 200: ha1
> location loc-6 mysql-resources 0: ha2

loc-6 is noop.

Thanks,

Dejan

> property $id="cib-bootstrap-options" \
>         dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
>         cluster-infrastructure="Heartbeat" \
>         stonith-enabled="true"
> 
> Also, I verified that STONITH is working.  I unplugged the network
> cable on ha1 when the virtual IP and mysqld were running.  ha2
> promptly took over the services and used STONITH to shut down ha1
> via iLO.  So, that part works and flawlessly.  There was once again
> a delay between the mysqld shutdown on ha2 and startup on ha1 after
> I brought ha1 back online, though.  Not as bad as before, about 25
> seconds this time.  It seems that the delay only occurs when the
> non-preferred node relinquishes control of the resources back to
> their preferred node following a failover.  If I stop preferring one
> node for the services, this might not be an issue any longer.
> 
>     - Dave
> 
> >_______________________________________________
> >Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> >Project Home: http://www.clusterlabs.org
> >Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >Bugs: http://bugs.clusterlabs.org
> 
> -- 
> 
> Dave Parker
> Systems Administrator
> Utica College
> Integrated Information Technology Services
> (315) 792-3229
> Registered Linux User #408177
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org