[ClusterLabs] Questions about pacemaker/ mysql resource agent behaviour when network fail

Fri Oct 5 08:00:05 EDT 2018

Hi all,

Using pacemaker 1.1.18-11 and mysql resource agent (
https://github.com/ClusterLabs/resource-agents/blob/RHEL6/heartbeat/mysql),
I run into an unwanted behaviour. My point of view of course, maybe it's
expected to be as it is that's why I ask.

# My test case is the following :

Everything is OK on my cluster, crm_mon output is as below (no failed
actions)

 Master/Slave Set: ms_mysql-master [ms_mysql]
     Masters: [ db-master ]
     Slaves: [ db-slave ]

1. I insert in a table on master, no issue data is replicated.
2. I shut down net int on the master (vm), pacemaker correctly start on the
other node. Master is seen as offline, and db-slave is now master

 Master/Slave Set: ms_mysql-master [ms_mysql]
     Masters: [ db-slave ]

3. I bring back my net int up, pacemaker see the node online and set the
old-master as a the new slave :

 Master/Slave Set: ms_mysql-master [ms_mysql]
     Masters: [ db-slave ]
     Slaves: [ db-master ]

4. From this point, my external monitoring bash script shows that SQL and
IO thread are not running, but I can't see any error in the pcs
status/crm_mon outputs. Consequence is that I continue inserting on my new
promoted master but the data is never consumed by my former master computer.

# Questions :

- Is this some kind of safety behaviour to avoid data corruption when a
node is back online ?
- When I want to manually start it like ocf does it returns this error :

mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE"
ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was not set;
Fix in config file or with CHANGE MASTER TO

- I would expect the cluster to stop the slave and show a failed action, am
I wrong here ?

# Other details (not sure it matters a lot)

No stonith enabled, no fencing or auto-failback. Symetric cluster
configured.

Details of my pacemaker resource configuration is

 Master: ms_mysql-master
  Meta Attrs: master-node-max=1 clone_max=2 globally-unique=false
clone-node-max=1 notify=true
  Resource: ms_mysql (class=ocf provider=heartbeat type=mysql)
   Attributes: binary=/usr/bin/mysqld_safe config=/etc/my.cnf.d/server.cnf
datadir=/var/lib/mysql evict_outdated_slaves=false max_slave_lag=15
pid=/var/lib/mysql/mysql.pid replication_passwd=mysqlreplpw
replication_user=user-repl socket=/var/lib/mysql/mysql.sock
test_passwd=mysqlrootpw test_user=root
   Operations: demote interval=0s timeout=120 (ms_mysql-demote-interval-0s)
               monitor interval=20 timeout=30 (ms_mysql-monitor-interval-20)
               monitor interval=10 role=Master timeout=30
(ms_mysql-monitor-interval-10)
               monitor interval=30 role=Slave timeout=30
(ms_mysql-monitor-interval-30)
               notify interval=0s timeout=90 (ms_mysql-notify-interval-0s)
               promote interval=0s timeout=120
(ms_mysql-promote-interval-0s)
               start interval=0s timeout=120 (ms_mysql-start-interval-0s)
               stop interval=0s timeout=120 (ms_mysql-stop-interval-0s)

Any things I'm missing on this ? Did not find a clearly similar usecase
when googling around network outage and pacemaker.

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20181005/52dc68dd/attachment.html>