[ClusterLabs] Mysql slave did not start replication after failure, and read-only IP also remained active on the much outdated slave

Mon Aug 22 05:18:36 EDT 2016

Dear community,

A few days ago we had an issue in our Mysql M/S replication cluster.
We have a one R/W Master, and a one RO Slave setup. RO VIP is supposed to be running on the slave if it is not too much behind the master, and if any error occurs, RO VIP is moved to the master.

Something happened with the slave Mysql (some disk issue, still investigating), but the problem is, that the slave VIP remained on the slave device, even though the slave process was not running, and the server was much outdated.

During the issue the following log entries appeared (just an extract as it would be too long):

Aug 20 02:04:07 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process was not scheduled for 14088.5488 ms (threshold is 4000.0000 ms). Consider token timeout increase.
Aug 20 02:04:07 ctdb1 corosync[1056]:   [TOTEM ] A processor failed, forming new configuration.
Aug 20 02:04:34 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process was not scheduled for 27065.2559 ms (threshold is 4000.0000 ms). Consider token timeout increase.
Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership (xxx:6720) was formed. Members left: 168362243 168362281 168362282 168362301 168362302 168362311 168362312 1
Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership (xxx:6724) was formed. Members
..
Aug 20 02:13:28 ctdb1 corosync[1056]:   [MAIN  ] Completed service synchronization, ready to provide service.
..
Aug 20 02:13:29 ctdb1 attrd[1584]:   notice: attrd_trigger_update: Sending flush op to all hosts for: readable (1)
...
Aug 20 02:13:32 ctdb1 mysql(db-mysql)[10492]: INFO: post-demote notification for ctdb1
Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-master)[10490]: INFO: IP status = ok, IP_CIP=
Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation db-ip-master_stop_0 (call=371, rc=0, cib-update=179, confirmed=true) ok
Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Adding inet address xxx/24 with broadcast address xxxx to device eth0
Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Bringing device eth0 up
Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: /usr/lib/heartbeat/send_arp -i 200 -r 5 -p /usr/var/run/resource-agents/send_arp-xxx eth0 xxx auto not_used not_used
Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation db-ip-slave_start_0 (call=377, rc=0, cib-update=180, confirmed=true) ok
Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation db-ip-slave_monitor_20000 (call=380, rc=0, cib-update=181, confirmed=false) ok
Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation db-mysql_notify_0 (call=374, rc=0, cib-update=0, confirmed=true) ok
Aug 20 02:13:32 ctdb1 attrd[1584]:   notice: attrd_trigger_update: Sending flush op to all hosts for: master-db-mysql (1)
Aug 20 02:13:32 ctdb1 attrd[1584]:   notice: attrd_perform_update: Sent update 1622: master-db-mysql=1
Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation db-mysql_demote_0 (call=384, rc=0, cib-update=182, confirmed=true) ok
Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11160]: INFO: Ignoring post-demote notification for my own demotion.
Aug 20 02:13:33 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation db-mysql_notify_0 (call=387, rc=0, cib-update=0, confirmed=true) ok
Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11185]: ERROR: check_slave invoked on an instance that is not a replication slave.
Aug 20 02:13:33 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation db-mysql_monitor_7000 (call=390, rc=0, cib-update=183, confirmed=false) ok
Aug 20 02:13:33 ctdb1 ntpd[1560]: Listen normally on 16 eth0 xxxx. UDP 123
Aug 20 02:13:33 ctdb1 ntpd[1560]: Deleting interface #12 eth0, xxx#123, interface stats: received=0, sent=0, dropped=0, active_time=2637334 secs
Aug 20 02:13:33 ctdb1 ntpd[1560]: peers refreshed
Aug 20 02:13:33 ctdb1 ntpd[1560]: new interface(s) found: waking up resolver
Aug 20 02:13:40 ctdb1 mysql(db-mysql)[11224]: ERROR: check_slave invoked on an instance that is not a replication slave.
Aug 20 02:13:47 ctdb1 mysql(db-mysql)[11263]: ERROR: check_slave invoked on an instance that is not a replication slave.

And from this time, the last two lines repeat every 7 seconds (mysql monitoring interval)

The expected behavior was that the slave (RO) VIP should have been moved to the master, as the secondary db was outdated.
Unfortunately I cannot recall what crm_mon was showing when the issue was present, but I am sure that the RA did not handle the situation properly.

Placing the slave node into standby and the online resolved the issue immediately (Slave started to sync, and in  a few minutes it catched up the master).

Here is the relevant config from the configuration:

primitive db-ip-master ocf:heartbeat:IPaddr2 \
                params lvs_support="true" ip="XXX" cidr_netmask="24" broadcast="XXX" \
                op start interval="0" timeout="20s" on-fail="restart" \
                op monitor interval="20s" timeout="20s" \
                op stop interval="0" timeout="20s" on-fail="block"

primitive db-ip-slave ocf:heartbeat:IPaddr2 \
                params lvs_support="true" ip="XXX" cidr_netmask="24" broadcast="XXX" \
                op start interval="0" timeout="20s" on-fail="restart" \
                op monitor interval="20s" timeout="20s" \
                op stop interval="0" timeout="20s" on-fail="block" \
                meta target-role="Started"

primitive db-mysql ocf:heartbeat:mysql \
                params binary="/usr/bin/mysqld_safe" config="/etc/mysql/my.cnf" datadir="/var/lib/mysql" user="mysql" pid="/var/run/mysqld/mysqld.pid" socket="/var/run/mysqld/mysqld.sock" test_passwd="XXX" test_table="XXX" test_user="XXX" replication_user="XXX" replication_passwd="XXX" additional_parameters="--skip-slave-start" \
                op start interval="0" timeout="240s" on-fail="restart" \
                op stop interval="0" timeout="120s" on-fail="block" \
                op monitor interval="7" timeout="30s" on-fail="restart" OCF_CHECK_LEVEL="1" \
                op promote interval="0" timeout="120" on-fail="restart" \
                op demote interval="0" timeout="120" on-fail="block"

ms mysql db-mysql \
                meta notify="true" master-max="1" clone-max="2" target-role="Started" is-managed="true"

location db-ip-m-1 db-ip-master 0: ctdb1
location db-ip-m-2 db-ip-master 0: ctdb2
location db-ip-s-1 db-ip-slave 0: ctdb1
location db-ip-s-2 db-ip-slave 0: ctdb2
location db-ip-s-readable db-ip-slave \
                rule $id="rule-no-reader-slave" -inf: readable lt 1
location db-mysql-loc-1 mysql 100: ctdb1
location db-mysql-loc-2 mysql 100: ctdb2

colocation db-ip-slave-master -50: db-ip-slave db-ip-master
colocation db-ip-with-master inf: db-ip-master mysql:Master
colocation db-slave-on-db inf: db-ip-slave mysql
order master-after-db inf: mysql db-ip-master
order slave-after-db inf: mysql db-ip-slave

property $id="cib-bootstrap-options" \
                dc-version="1.1.10-42f2063" \
                cluster-infrastructure="corosync" \
                symmetric-cluster="false" \
                cluster-recheck-interval="2m" \
                no-quorum-policy="stop" \
                stop-orphan-resources="false" \
                start-failure-is-fatal="false" \
                maintenance-mode="false"
property $id="mysql_replication" \
                db-mysql_REPL_INFO="ctdb2|mysql-bin.002928|107"
rsc_defaults $id="rsc-options" \
                resource-stickiness="0"

Do you have any hints what could have gone wrong, and how we could avoid such issues in the future?

Versions:
Ubuntu Trusty Tahr
Pacemaker 1.1.10
Corosync 2.3.3
Resource agents 3.9.3

Thanks a lot in advance,

Attila

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160822/cba61a43/attachment-0002.html>