[ClusterLabs] Mysql slave did not start replication after failure, and read-only IP also remained active on the much outdated slave

Mon Aug 22 09:41:37 UTC 2016

On Mon, Aug 22, 2016 at 12:18 PM, Attila Megyeri
<amegyeri at minerva-soft.com> wrote:
> Dear community,
>
>
>
> A few days ago we had an issue in our Mysql M/S replication cluster.
>
> We have a one R/W Master, and a one RO Slave setup. RO VIP is supposed to be
> running on the slave if it is not too much behind the master, and if any
> error occurs, RO VIP is moved to the master.
>
>
>
> Something happened with the slave Mysql (some disk issue, still
> investigating), but the problem is, that the slave VIP remained on the slave
> device, even though the slave process was not running, and the server was
> much outdated.
>
>
>
> During the issue the following log entries appeared (just an extract as it
> would be too long):
>
>
>
>
>
> Aug 20 02:04:07 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process was
> not scheduled for 14088.5488 ms (threshold is 4000.0000 ms). Consider token
> timeout increase.
>
> Aug 20 02:04:07 ctdb1 corosync[1056]:   [TOTEM ] A processor failed, forming
> new configuration.
>
> Aug 20 02:04:34 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process was
> not scheduled for 27065.2559 ms (threshold is 4000.0000 ms). Consider token
> timeout increase.
>
> Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership (xxx:6720)
> was formed. Members left: 168362243 168362281 168362282 168362301 168362302
> 168362311 168362312 1
>
> Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership (xxx:6724)
> was formed. Members
>
> ..
>
> Aug 20 02:13:28 ctdb1 corosync[1056]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
>
> ..
>
> Aug 20 02:13:29 ctdb1 attrd[1584]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: readable (1)
>
> …
>
> Aug 20 02:13:32 ctdb1 mysql(db-mysql)[10492]: INFO: post-demote notification
> for ctdb1
>
> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-master)[10490]: INFO: IP status = ok,
> IP_CIP=
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-ip-master_stop_0 (call=371, rc=0, cib-update=179, confirmed=true) ok
>
> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Adding inet address
> xxx/24 with broadcast address xxxx to device eth0
>
> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Bringing device
> eth0 up
>
> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO:
> /usr/lib/heartbeat/send_arp -i 200 -r 5 -p
> /usr/var/run/resource-agents/send_arp-xxx eth0 xxx auto not_used not_used
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-ip-slave_start_0 (call=377, rc=0, cib-update=180, confirmed=true) ok
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-ip-slave_monitor_20000 (call=380, rc=0, cib-update=181, confirmed=false)
> ok
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-mysql_notify_0 (call=374, rc=0, cib-update=0, confirmed=true) ok
>
> Aug 20 02:13:32 ctdb1 attrd[1584]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: master-db-mysql (1)
>
> Aug 20 02:13:32 ctdb1 attrd[1584]:   notice: attrd_perform_update: Sent
> update 1622: master-db-mysql=1
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-mysql_demote_0 (call=384, rc=0, cib-update=182, confirmed=true) ok
>
> Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11160]: INFO: Ignoring post-demote
> notification for my own demotion.
>
> Aug 20 02:13:33 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-mysql_notify_0 (call=387, rc=0, cib-update=0, confirmed=true) ok
>
> Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11185]: ERROR: check_slave invoked on
> an instance that is not a replication slave.
>
> Aug 20 02:13:33 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-mysql_monitor_7000 (call=390, rc=0, cib-update=183, confirmed=false) ok
>
> Aug 20 02:13:33 ctdb1 ntpd[1560]: Listen normally on 16 eth0 xxxx. UDP 123
>
> Aug 20 02:13:33 ctdb1 ntpd[1560]: Deleting interface #12 eth0, xxx#123,
> interface stats: received=0, sent=0, dropped=0, active_time=2637334 secs
>
> Aug 20 02:13:33 ctdb1 ntpd[1560]: peers refreshed
>
> Aug 20 02:13:33 ctdb1 ntpd[1560]: new interface(s) found: waking up resolver
>
> Aug 20 02:13:40 ctdb1 mysql(db-mysql)[11224]: ERROR: check_slave invoked on
> an instance that is not a replication slave.
>
> Aug 20 02:13:47 ctdb1 mysql(db-mysql)[11263]: ERROR: check_slave invoked on
> an instance that is not a replication slave.
>
>
>
> And from this time, the last two lines repeat every 7 seconds (mysql
> monitoring interval)
>
>
>
>
>
> The expected behavior was that the slave (RO) VIP should have been moved to
> the master, as the secondary db was outdated.
>
> Unfortunately I cannot recall what crm_mon was showing when the issue was
> present, but I am sure that the RA did not handle the situation properly.
>
>
>
> Placing the slave node into standby and the online resolved the issue
> immediately (Slave started to sync, and in  a few minutes it catched up the
> master).
>
>
>
>
>
> Here is the relevant config from the configuration:
>
>
>
>
>
> primitive db-ip-master ocf:heartbeat:IPaddr2 \
>
>                 params lvs_support="true" ip="XXX" cidr_netmask="24"
> broadcast="XXX" \
>
>                 op start interval="0" timeout="20s" on-fail="restart" \
>
>                 op monitor interval="20s" timeout="20s" \
>
>                 op stop interval="0" timeout="20s" on-fail="block"
>
>
>
> primitive db-ip-slave ocf:heartbeat:IPaddr2 \
>
>                 params lvs_support="true" ip="XXX" cidr_netmask="24"
> broadcast="XXX" \
>
>                 op start interval="0" timeout="20s" on-fail="restart" \
>
>                 op monitor interval="20s" timeout="20s" \
>
>                 op stop interval="0" timeout="20s" on-fail="block" \
>
>                 meta target-role="Started"
>
>
>
> primitive db-mysql ocf:heartbeat:mysql \
>
>                 params binary="/usr/bin/mysqld_safe"
> config="/etc/mysql/my.cnf" datadir="/var/lib/mysql" user="mysql"
> pid="/var/run/mysqld/mysqld.pid" socket="/var/run/mysqld/mysqld.sock"
> test_passwd="XXX" test_table="XXX" test_user="XXX" replication_user="XXX"
> replication_passwd="XXX" additional_parameters="--skip-slave-start" \
>
>                 op start interval="0" timeout="240s" on-fail="restart" \
>
>                 op stop interval="0" timeout="120s" on-fail="block" \
>
>                 op monitor interval="7" timeout="30s" on-fail="restart"
> OCF_CHECK_LEVEL="1" \
>
>                 op promote interval="0" timeout="120" on-fail="restart" \
>
>                 op demote interval="0" timeout="120" on-fail="block"
>
>
>
> ms mysql db-mysql \
>
>                 meta notify="true" master-max="1" clone-max="2"
> target-role="Started" is-managed="true"
>
>
>
> location db-ip-m-1 db-ip-master 0: ctdb1
>
> location db-ip-m-2 db-ip-master 0: ctdb2
>
> location db-ip-s-1 db-ip-slave 0: ctdb1
>
> location db-ip-s-2 db-ip-slave 0: ctdb2
>
> location db-ip-s-readable db-ip-slave \
>
>                 rule $id="rule-no-reader-slave" -inf: readable lt 1
>

How long did you wait? Conditions are reevaluated every
cluster-recheck-interval which is 15 minutes by default.

> location db-mysql-loc-1 mysql 100: ctdb1
>
> location db-mysql-loc-2 mysql 100: ctdb2
>
>
>
> colocation db-ip-slave-master -50: db-ip-slave db-ip-master
>
> colocation db-ip-with-master inf: db-ip-master mysql:Master
>
> colocation db-slave-on-db inf: db-ip-slave mysql
>
> order master-after-db inf: mysql db-ip-master
>
> order slave-after-db inf: mysql db-ip-slave
>
>
>
> property $id="cib-bootstrap-options" \
>
>                 dc-version="1.1.10-42f2063" \
>
>                 cluster-infrastructure="corosync" \
>
>                 symmetric-cluster="false" \
>
>                 cluster-recheck-interval="2m" \
>
>                 no-quorum-policy="stop" \
>
>                 stop-orphan-resources="false" \
>
>                 start-failure-is-fatal="false" \
>
>                 maintenance-mode="false"
>
> property $id="mysql_replication" \
>
>                 db-mysql_REPL_INFO="ctdb2|mysql-bin.002928|107"
>
> rsc_defaults $id="rsc-options" \
>
>                 resource-stickiness="0"
>
>
>
>
>
>
>
> Do you have any hints what could have gone wrong, and how we could avoid
> such issues in the future?
>
>
>
> Versions:
>
> Ubuntu Trusty Tahr
>
> Pacemaker 1.1.10
>
> Corosync 2.3.3
>
> Resource agents 3.9.3
>
>
>
>
>
> Thanks a lot in advance,
>
>
>
> Attila
>
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>