[Pacemaker] Issues with HA cluster for mysqld

Thu Aug 23 16:47:11 EDT 2012

On 08/23/2012 04:19 PM, Jake Smith wrote:
>> Okay, I think I've almost got this.  I updated my Pacemaker config
>> and
>> made a few changes.  I put the MysqlIP and mysqld primitives into a
>> resource group called "mysqld-resources", ordered them such that
>> mysqld
>> will always wait for MysqlIP to be ready first, and added constraints
>> to
>> make ha1 the preferred host for the mysqld-resources group and ha2
>> the
>> failover host.  I also created STONITH devices for both ha1 and ha2,
>> and
>> added constraints to fix the STONIOTH location issues.  My new
>> constraints section looks like this:
>>
>> <constraints>
>> <rsc_location id="loc-1" rsc="stonith-ha1" node="ha2"
>> score="INFINITY"/>
>> <rsc_location id="loc-2" rsc="stonith-ha2" node="ha1"
>> score="INFINITY"/>
> Don't need the 2 above as long as you have the 2 negative locations below for stonith locations.  I prefer the negative below because if you ever expanded to greater than 2 nodes the stonith for any node could run on any node but itself.

Good call.  I'll take those out of the config.

>> <rsc_location id="loc-3" rsc="stonith-ha1" node="ha1"
>> score="-INFINITY"/>
>> <rsc_location id="loc-4" rsc="stonith-ha2" node="ha2"
>> score="-INFINITY"/>
>> <rsc_location id="loc-5" rsc="mysql-resources" node="ha1"
>> score="200"/>
> Don't need the 0 score below either - the 200 above will take care of it.  Pretty sure no location constraint is the same as a 0 score location.

That was based on the example found in the documentation.  If I don't 
have the 0 score entry, will the service still fail over?

>> <rsc_location id="loc-6" rsc="mysql-resources" node="ha2" score="0"/>
>> </constraints>
>>
>> Everything seems to work.  I had the virtual IP and mysqld running on
>> ha1, and not on ha2.  I shut down ha1 using "poweroff -n" and both
>> the
>> virtual IP and mysqld came up on ha2 almost instantly.  When I
>> powered
>> ha1 on again, ha2 shut down the the virtual IP and mysqld.  The
>> virtual
>> IP moved over instantly; a continuous ping of the IP produced one
>> "Time
>> to live exceeded" message and one packet was lost, but that's to be
>> expected.  However, mysqld took almost 30 seconds to start up on ha1
>> after being stopped on ha2, and I'm not exactly sure why.
>>
>> Here's the relevant log output from ha2:
>>
>> Aug 23 11:42:48 ha2 crmd: [1166]: info: te_rsc_command: Initiating
>> action 16: stop mysqld_stop_0 on ha2 (local)
>> Aug 23 11:42:48 ha2 crmd: [1166]: info: do_lrm_rsc_op: Performing
>> key=16:1:0:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_stop_0 )
>> Aug 23 11:42:48 ha2 lrmd: [1163]: info: rsc:mysqld:10: stop
>> Aug 23 11:42:50 ha2 lrmd: [1163]: info: RA output:
>> (mysqld:stop:stdout)
>> Stopping MySQL daemon: mysqld_safe.
>> Aug 23 11:42:50 ha2 crmd: [1166]: info: process_lrm_event: LRM
>> operation
>> mysqld_stop_0 (call=10, rc=0, cib-update=57, confirmed=true) ok
>> Aug 23 11:42:50 ha2 crmd: [1166]: info: match_graph_event: Action
>> mysqld_stop_0 (16) confirmed on ha2 (rc=0)
>>
>> And here's the relevant log output from ha1:
>>
>> Aug 23 11:42:47 ha1 crmd: [1243]: info: do_lrm_rsc_op: Performing
>> key=8:1:7:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_monitor_0 )
>> Aug 23 11:42:47 ha1 lrmd: [1240]: info: rsc:mysqld:5: probe
>> Aug 23 11:42:47 ha1 crmd: [1243]: info: process_lrm_event: LRM
>> operation
>> mysqld_monitor_0 (call=5, rc=7, cib-update=10, confirmed=true) not
>> running
>> Aug 23 11:43:36 ha1 crmd: [1243]: info: do_lrm_rsc_op: Performing
>> key=11:3:0:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_start_0 )
>> Aug 23 11:43:36 ha1 lrmd: [1240]: info: rsc:mysqld:11: start
>> Aug 23 11:43:36 ha1 lrmd: [1240]: info: RA output:
>> (mysqld:start:stdout)
>> Starting MySQL daemon: mysqld_safe.#012(See
>> /usr/local/mysql/data/mysql.messages for messages).
>> Aug 23 11:43:36 ha1 crmd: [1243]: info: process_lrm_event: LRM
>> operation
>> mysqld_start_0 (call=11, rc=0, cib-update=18, confirmed=true) ok
>>
>> So, ha2 stopped mysqld at 11:42:50, but ha1 didn't start mysqld until
>> 11:43:36, a full 46 seconds after it was stopped on ha2.  Any ideas
>> why
>> the delay for mysqld was so long, when the MysqlIP resource moved
>> almost
>> instantly?
> Couple thoughts.
>
> Are you sure both servers have the same time (in sync)?

Yep.  They're both using NTP.

> On HA2 did verify mysqld was actually done stopping at the 11:42:50 mark?
> I don't use mysql so I can't say from experience.

Yes, I kept checking (with "ps -ef | grep mysqld") every few seconds, 
and it stopped running around that time.  As soon as it stopped running 
on ha2, I started checking on ha1 and it was quite a while before mysqld 
started.  I knew it was at least 30 seconds, and I believe it was 
actually 42 seconds as the logs indicate.

> Just curious but do you really want it to failback if it's actively running on ha2?

Interesting point.  I had just assumed that it was good practice to have 
a preferred node for a service, but I guess it doesn't matter.  If I 
don't care which node the services run on, do I just remove the location 
constraints for the "mysql-resources" group altogether?

> Could you include the output of '$crm configure show' next time?  I read that much better/quicker than the xml pacemaker config :-)
>
> Jake

Thanks so much for all of your help, Jake!  I'm new to all of this, and 
I really appreciate it.

Here's the requested output:

root at ha1:~# crm configure show
node $id="1b48f410-44d1-4e89-8b52-ff23b32db1bc" ha1
node $id="9790fe6e-67b2-4817-abf4-966b5aa6948c" ha2
primitive MysqlIP ocf:heartbeat:IPaddr2 \
         params ip="192.168.25.9" cidr_netmask="32" \
         op monitor interval="10s"
primitive mysqld lsb:mysqld
primitive stonith-ha1 stonith:external/riloe \
         params hostlist="ha1" ilo_hostname="10.0.1.111" 
ilo_user="Administrator" ilo_password="XXXXXXXX" ilo_can_reset="1" 
ilo_protocol="2.0" ilo_powerdown_method="button"
primitive stonith-ha2 stonith:external/riloe \
         params hostlist="ha2" ilo_hostname="10.0.1.112" 
ilo_user="Administrator" ilo_password="XXXXXXXX" ilo_can_reset="1" 
ilo_protocol="2.0" ilo_powerdown_method="button"
group mysql-resources MysqlIP mysqld
location loc-1 stonith-ha1 inf: ha2
location loc-2 stonith-ha2 inf: ha1
location loc-3 stonith-ha1 -inf: ha1
location loc-4 stonith-ha2 -inf: ha2
location loc-5 mysql-resources 200: ha1
location loc-6 mysql-resources 0: ha2
property $id="cib-bootstrap-options" \
         dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
         cluster-infrastructure="Heartbeat" \
         stonith-enabled="true"

Also, I verified that STONITH is working.  I unplugged the network cable 
on ha1 when the virtual IP and mysqld were running.  ha2 promptly took 
over the services and used STONITH to shut down ha1 via iLO.  So, that 
part works and flawlessly.  There was once again a delay between the 
mysqld shutdown on ha2 and startup on ha1 after I brought ha1 back 
online, though.  Not as bad as before, about 25 seconds this time.  It 
seems that the delay only occurs when the non-preferred node 
relinquishes control of the resources back to their preferred node 
following a failover.  If I stop preferring one node for the services, 
this might not be an issue any longer.

     - Dave

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 

Dave Parker
Systems Administrator
Utica College
Integrated Information Technology Services
(315) 792-3229
Registered Linux User #408177