[ClusterLabs] Questions about pacemaker/ mysql resource agent behaviour when network fail

Andrei Borzenkov arvidjaar at gmail.com
Sat Oct 6 00:13:10 EDT 2018


05.10.2018 15:00, Simon Bomm пишет:
> Hi all,
> 
> Using pacemaker 1.1.18-11 and mysql resource agent (
> https://github.com/ClusterLabs/resource-agents/blob/RHEL6/heartbeat/mysql),
> I run into an unwanted behaviour. My point of view of course, maybe it's
> expected to be as it is that's why I ask.
> 
> # My test case is the following :
> 
> Everything is OK on my cluster, crm_mon output is as below (no failed
> actions)
> 
>  Master/Slave Set: ms_mysql-master [ms_mysql]
>      Masters: [ db-master ]
>      Slaves: [ db-slave ]
> 
> 1. I insert in a table on master, no issue data is replicated.
> 2. I shut down net int on the master (vm),

What exactly does it mean? How do you shut down net?

> pacemaker correctly start on the
> other node. Master is seen as offline, and db-slave is now master
> 
>  Master/Slave Set: ms_mysql-master [ms_mysql]
>      Masters: [ db-slave ]
> 
> 3. I bring back my net int up, pacemaker see the node online and set the
> old-master as a the new slave :
> 
>  Master/Slave Set: ms_mysql-master [ms_mysql]
>      Masters: [ db-slave ]
>      Slaves: [ db-master ]
> 
> 4. From this point, my external monitoring bash script shows that SQL and
> IO thread are not running, but I can't see any error in the pcs
> status/crm_mon outputs.

Pacemaker just shows what resource agents claim. If resource agent
claims resource is started, there is nothing pacemaker can do. You need
to debug what resource agent does.

> Consequence is that I continue inserting on my new
> promoted master but the data is never consumed by my former master computer.
> 
> # Questions :
> 
> - Is this some kind of safety behaviour to avoid data corruption when a
> node is back online ?
> - When I want to manually start it like ocf does it returns this error :
> 
> mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE"
> ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was not set;
> Fix in config file or with CHANGE MASTER TO
> 
> - I would expect the cluster to stop the slave and show a failed action, am
> I wrong here ?
> 

I am not familiar with specific application and its structure. From
quick browsing monitor action does mostly check for running process. Is
mySQL process running?

> # Other details (not sure it matters a lot)
> 
> No stonith enabled, no fencing or auto-failback.

How are you going to resolve split-brain without stonith? "Stopping net"
sounds exactly like split brain, in which case further investigation is
rather pointless.

Anyway, to give some non-hypothetical answer full configuration and logs
from both systems are needed.

> Symetric cluster
> configured.
> 
> Details of my pacemaker resource configuration is
> 
>  Master: ms_mysql-master
>   Meta Attrs: master-node-max=1 clone_max=2 globally-unique=false
> clone-node-max=1 notify=true
>   Resource: ms_mysql (class=ocf provider=heartbeat type=mysql)
>    Attributes: binary=/usr/bin/mysqld_safe config=/etc/my.cnf.d/server.cnf
> datadir=/var/lib/mysql evict_outdated_slaves=false max_slave_lag=15
> pid=/var/lib/mysql/mysql.pid replication_passwd=mysqlreplpw
> replication_user=user-repl socket=/var/lib/mysql/mysql.sock
> test_passwd=mysqlrootpw test_user=root
>    Operations: demote interval=0s timeout=120 (ms_mysql-demote-interval-0s)
>                monitor interval=20 timeout=30 (ms_mysql-monitor-interval-20)
>                monitor interval=10 role=Master timeout=30
> (ms_mysql-monitor-interval-10)
>                monitor interval=30 role=Slave timeout=30
> (ms_mysql-monitor-interval-30)
>                notify interval=0s timeout=90 (ms_mysql-notify-interval-0s)
>                promote interval=0s timeout=120
> (ms_mysql-promote-interval-0s)
>                start interval=0s timeout=120 (ms_mysql-start-interval-0s)
>                stop interval=0s timeout=120 (ms_mysql-stop-interval-0s)
> 
> Any things I'm missing on this ? Did not find a clearly similar usecase
> when googling around network outage and pacemaker.
> 
> Thanks
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 




More information about the Users mailing list