[ClusterLabs] Questions about pacemaker/ mysql resource agent behaviour when network fail

Andrei Borzenkov arvidjaar at gmail.com
Wed Oct 10 17:35:29 UTC 2018


10.10.2018 13:18, Simon Bomm пишет:
> Le sam. 6 oct. 2018 à 06:13, Andrei Borzenkov <arvidjaar at gmail.com> a
> écrit :
> 
>> 05.10.2018 15:00, Simon Bomm пишет:
>>> Hi all,
>>>
>>> Using pacemaker 1.1.18-11 and mysql resource agent (
>>>
>> https://github.com/ClusterLabs/resource-agents/blob/RHEL6/heartbeat/mysql
>> ),
>>> I run into an unwanted behaviour. My point of view of course, maybe it's
>>> expected to be as it is that's why I ask.
>>>
>>> # My test case is the following :
>>>
>>> Everything is OK on my cluster, crm_mon output is as below (no failed
>>> actions)
>>>
>>>  Master/Slave Set: ms_mysql-master [ms_mysql]
>>>      Masters: [ db-master ]
>>>      Slaves: [ db-slave ]
>>>
>>> 1. I insert in a table on master, no issue data is replicated.
>>> 2. I shut down net int on the master (vm),
>>
>>
> First, thanks for taking time to answer me
> 
> 
>> What exactly does it mean? How do you shut down net?
>>
>>
> Disconnect the network card from VMWare vSphere Console
> 
> 
>>> pacemaker correctly start on the
>>> other node. Master is seen as offline, and db-slave is now master
>>>
>>>  Master/Slave Set: ms_mysql-master [ms_mysql]
>>>      Masters: [ db-slave ]
>>>
>>> 3. I bring back my net int up, pacemaker see the node online and set the
>>> old-master as a the new slave :
>>>
>>>  Master/Slave Set: ms_mysql-master [ms_mysql]
>>>      Masters: [ db-slave ]
>>>      Slaves: [ db-master ]
>>>
>>> 4. From this point, my external monitoring bash script shows that SQL and
>>> IO thread are not running, but I can't see any error in the pcs
>>> status/crm_mon outputs.
>>
>> Pacemaker just shows what resource agents claim. If resource agent
>> claims resource is started, there is nothing pacemaker can do. You need
>> to debug what resource agent does.
>>
>>
> I've debugged it quite a lot, and that's what drove me to isolate error
> below :
> 
>> mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE"
>> ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was not
> set;
>> Fix in config file or with CHANGE MASTER TO
> 
> 
> 
>>> Consequence is that I continue inserting on my new
>>> promoted master but the data is never consumed by my former master
>> computer.
>>>
>>> # Questions :
>>>
>>> - Is this some kind of safety behaviour to avoid data corruption when a
>>> node is back online ?
>>> - When I want to manually start it like ocf does it returns this error :
>>>
>>> mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE"
>>> ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was not
>> set;
>>> Fix in config file or with CHANGE MASTER TO
>>>
>>> - I would expect the cluster to stop the slave and show a failed action,
>> am
>>> I wrong here ?
>>>
>>
>> I am not familiar with specific application and its structure. From
>> quick browsing monitor action does mostly check for running process. Is
>> mySQL process running?
>>
> 
> Yes it is, as you mentionned previously the config wants pacemaker to start
> mysql resource so no problems.
> 
>>
>>> # Other details (not sure it matters a lot)
>>>
>>> No stonith enabled, no fencing or auto-failback.
>>
>> How are you going to resolve split-brain without stonith? "Stopping net"
>> sounds exactly like split brain, in which case further investigation is
>> rather pointless.
>>
>>
> You make the point, as I'm not very familiar with stonithd, I first disable
> this to avoid unwanted behaviour but I'll definitely follow your advise and
> dig around.
> 
> 
>> Anyway, to give some non-hypothetical answer full configuration and logs
>> from both systems are needed.
>>
>>
> Sure, please find the full configuration
> 
> Cluster Name: app_cluster
> Corosync Nodes:
>  app-central-master app-central-slave app-db-master app-db-slave app-quorum
> Pacemaker Nodes:
>  app-central-master app-central-slave app-db-master app-db-slave app-quorum
> 
> Resources:
>  Master: ms_mysql-master
>   Meta Attrs: master-node-max=1 clone_max=2 globally-unique=false
> clone-node-max=1 notify=true
>   Resource: ms_mysql (class=ocf provider=heartbeat type=mysql-app)
>    Attributes: binary=/usr/bin/mysqld_safe config=/etc/my.cnf.d/server.cnf
> datadir=/var/lib/mysql evict_outdated_slaves=false max_slave_lag=15
> pid=/var/lib/mysql/mysql.pid replication_passwd=mysqlreplpw
> replication_user=app-repl socket=/var/lib/mysql/mysql.sock
> test_passwd=mysqlrootpw test_user=root
>    Operations: demote interval=0s timeout=120 (ms_mysql-demote-interval-0s)
>                monitor interval=20 timeout=30 (ms_mysql-monitor-interval-20)
>                monitor interval=10 role=Master timeout=30
> (ms_mysql-monitor-interval-10)
>                monitor interval=30 role=Slave timeout=30
> (ms_mysql-monitor-interval-30)
>                notify interval=0s timeout=90 (ms_mysql-notify-interval-0s)
>                promote interval=0s timeout=120
> (ms_mysql-promote-interval-0s)
>                start interval=0s timeout=120 (ms_mysql-start-interval-0s)
>                stop interval=0s timeout=120 (ms_mysql-stop-interval-0s)
>  Resource: vip_mysql (class=ocf provider=heartbeat type=IPaddr2-app)
>   Attributes: broadcast=10.30.255.255 cidr_netmask=16 flush_routes=true
> ip=10.30.3.229 nic=ens160
>   Operations: monitor interval=10s timeout=20s
> (vip_mysql-monitor-interval-10s)
>               start interval=0s timeout=20s (vip_mysql-start-interval-0s)
>               stop interval=0s timeout=20s (vip_mysql-stop-interval-0s)
>  Group: app
>   Resource: misc_app (class=ocf provider=heartbeat type=misc-app)
>    Attributes: crondir=/etc/app-failover/resources/cron/,/etc/cron.d/
>    Meta Attrs: target-role=started
>    Operations: monitor interval=5s timeout=20s
> (misc_app-monitor-interval-5s)
>                start interval=0s timeout=20s (misc_app-start-interval-0s)
>                stop interval=0s timeout=20s (misc_app-stop-interval-0s)
>   Resource: cbd_central_broker (class=ocf provider=heartbeat
> type=cbd-central-broker)
>    Meta Attrs: target-role=started
>    Operations: monitor interval=5s timeout=20s
> (cbd_central_broker-monitor-interval-5s)
>                start interval=0s timeout=90s
> (cbd_central_broker-start-interval-0s)
>                stop interval=0s timeout=90s
> (cbd_central_broker-stop-interval-0s)
>   Resource: centcore (class=ocf provider=heartbeat type=centcore)
>    Meta Attrs: target-role=started
>    Operations: monitor interval=5s timeout=20s
> (centcore-monitor-interval-5s)
>                start interval=0s timeout=90s (centcore-start-interval-0s)
>                stop interval=0s timeout=90s (centcore-stop-interval-0s)
>   Resource: apptrapd (class=ocf provider=heartbeat type=apptrapd)
>    Meta Attrs: target-role=started
>    Operations: monitor interval=5s timeout=20s
> (apptrapd-monitor-interval-5s)
>                start interval=0s timeout=90s (apptrapd-start-interval-0s)
>                stop interval=0s timeout=90s (apptrapd-stop-interval-0s)
>   Resource: app_central_sync (class=ocf provider=heartbeat
> type=app-central-sync)
>    Meta Attrs: target-role=started
>    Operations: monitor interval=5s timeout=20s
> (app_central_sync-monitor-interval-5s)
>                start interval=0s timeout=90s
> (app_central_sync-start-interval-0s)
>                stop interval=0s timeout=90s
> (app_central_sync-stop-interval-0s)
>   Resource: snmptrapd (class=ocf provider=heartbeat type=snmptrapd)
>    Meta Attrs: target-role=started
>    Operations: monitor interval=5s timeout=20s
> (snmptrapd-monitor-interval-5s)
>                start interval=0s timeout=90s (snmptrapd-start-interval-0s)
>                stop interval=0s timeout=90s (snmptrapd-stop-interval-0s)
>   Resource: http (class=ocf provider=heartbeat type=apacheapp)
>    Meta Attrs: target-role=started
>    Operations: monitor interval=5s timeout=20s (http-monitor-interval-5s)
>                start interval=0s timeout=40s (http-start-interval-0s)
>                stop interval=0s timeout=60s (http-stop-interval-0s)
>   Resource: vip_app (class=ocf provider=heartbeat type=IPaddr2-app)
>    Attributes: broadcast=10.30.255.255 cidr_netmask=16 flush_routes=true
> ip=10.30.3.230 nic=ens160
>    Meta Attrs: target-role=started
>    Operations: monitor interval=10s timeout=20s
> (vip_app-monitor-interval-10s)
>                start interval=0s timeout=20s (vip_app-start-interval-0s)
>                stop interval=0s timeout=20s (vip_app-stop-interval-0s)
>   Resource: centengine (class=ocf provider=heartbeat type=centengine)
>    Meta Attrs: multiple-active=stop_start target-role=started
>    Operations: monitor interval=5s timeout=20s
> (centengine-monitor-interval-5s)
>                start interval=0s timeout=90s (centengine-start-interval-0s)
>                stop interval=0s timeout=90s (centengine-stop-interval-0s)
> 
> Stonith Devices:
> Fencing Levels:
> 
> Location Constraints:
>   Resource: app
>     Disabled on: app-db-master (score:-INFINITY)
> (id:location-app-app-db-master--INFINITY)
>     Disabled on: app-db-slave (score:-INFINITY)
> (id:location-app-app-db-slave--INFINITY)
>   Resource: ms_mysql
>     Disabled on: app-central-master (score:-INFINITY)
> (id:location-ms_mysql-app-central-master--INFINITY)
>     Disabled on: app-central-slave (score:-INFINITY)
> (id:location-ms_mysql-app-central-slave--INFINITY)
>   Resource: vip_mysql
>     Disabled on: app-central-master (score:-INFINITY)
> (id:location-vip_mysql-app-central-master--INFINITY)
>     Disabled on: app-central-slave (score:-INFINITY)
> (id:location-vip_mysql-app-central-slave--INFINITY)
> Ordering Constraints:
> Colocation Constraints:
>   vip_mysql with ms_mysql-master (score:INFINITY) (rsc-role:Started)
> (with-rsc-role:Master)
>   ms_mysql-master with vip_mysql (score:INFINITY) (rsc-role:Master)
> (with-rsc-role:Started)
> Ticket Constraints:
> 
> Alerts:
>  No alerts defined
> 
> Resources Defaults:
>  resource-stickiness: INFINITY
> Operations Defaults:
>  No defaults set
> 
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: app_cluster
>  dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9
>  have-watchdog: false
>  last-lrm-refresh: 1538740285
>  ms_mysql_REPL_INFO: app-db-master|mysql-bin.000012|327
>  stonith-enabled: false
>  symmetric-cluster: true
> Node Attributes:
>  app-quorum: standby=on
> 
> Quorum:
>   Options:
>   Device:
>     votes: 1
>     Model: net
>       algorithm: ffsplit
>       host: app-quorum
> 
> 
> Logs are below
> 
> SLAVE when I disconnect interface (node is isolated), and associated
> crm_mon, lgtm and can get the behaviour :
> 
> Oct 10 09:20:07 app-db-slave corosync[1055]: [TOTEM ] A processor failed,
> forming new configuration.
> Oct 10 09:20:11 app-db-slave corosync[1055]: [TOTEM ] A new membership (
> 10.30.3.245:196) was formed. Members left: 3
> Oct 10 09:20:11 app-db-slave corosync[1055]: [TOTEM ] Failed to receive the
> leave message. failed: 3
> Oct 10 09:20:11 app-db-slave corosync[1055]: [QUORUM] Members[4]: 1 2 4 5
> Oct 10 09:20:11 app-db-slave corosync[1055]: [MAIN  ] Completed service
> synchronization, ready to provide service.
> Oct 10 09:20:11 app-db-slave cib[1168]:  notice: Node app-db-master state
> is now lost
> Oct 10 09:20:11 app-db-slave attrd[1172]:  notice: Node app-db-master state
> is now lost
> Oct 10 09:20:11 app-db-slave attrd[1172]:  notice: Removing all
> app-db-master attributes for peer loss
> Oct 10 09:20:11 app-db-slave stonith-ng[1170]:  notice: Node app-db-master
> state is now lost
> Oct 10 09:20:11 app-db-slave pacemakerd[1084]:  notice: Node app-db-master
> state is now lost
> Oct 10 09:20:11 app-db-slave crmd[1175]:  notice: Node app-db-master state
> is now lost
> Oct 10 09:20:11 app-db-slave cib[1168]:  notice: Purged 1 peer with id=3
> and/or uname=app-db-master from the membership cache
> Oct 10 09:20:11 app-db-slave stonith-ng[1170]:  notice: Purged 1 peer with
> id=3 and/or uname=app-db-master from the membership cache
> Oct 10 09:20:11 app-db-slave attrd[1172]:  notice: Purged 1 peer with id=3
> and/or uname=app-db-master from the membership cache
> Oct 10 09:20:11 app-db-slave crmd[1175]:  notice: Result of notify
> operation for ms_mysql on app-db-slave: 0 (ok)
> Oct 10 09:20:12 app-db-slave mysql-app(ms_mysql)[21165]: INFO: app-db-slave
> promote is starting
> Oct 10 09:20:12 app-db-slave IPaddr2-app(vip_mysql)[21134]: INFO: Adding
> inet address 10.30.3.229/16 with broadcast address 10.30.255.255 to device
> ens160
> Oct 10 09:20:12 app-db-slave IPaddr2-app(vip_mysql)[21134]: INFO: Bringing
> device ens160 up
> Oct 10 09:20:12 app-db-slave IPaddr2-app(vip_mysql)[21134]: INFO:
> /usr/libexec/heartbeat/send_arp -i 200 -c 5 -I ens160 -s 10.30.3.229
> 10.30.255.255
> Oct 10 09:20:12 app-db-slave crmd[1175]:  notice: Result of start operation
> for vip_mysql on app-db-slave: 0 (ok)
> Oct 10 09:20:12 app-db-slave lrmd[1171]:  notice:
> ms_mysql_promote_0:21165:stderr [ Error performing operation: No such
> device or address ]
> Oct 10 09:20:12 app-db-slave crmd[1175]:  notice: Result of promote
> operation for ms_mysql on app-db-slave: 0 (ok)
> Oct 10 09:20:12 app-db-slave mysql-app(ms_mysql)[21285]: INFO: app-db-slave
> This will be the new master, ignoring post-promote notification.
> Oct 10 09:20:12 app-db-slave crmd[1175]:  notice: Result of notify
> operation for ms_mysql on app-db-slave: 0 (ok)
> 
> 
> Node app-quorum: standby
> Online: [ app-central-master app-central-slave app-db-slave ]
> OFFLINE: [ app-db-master ]
> 
> Active resources:
> 
>  Master/Slave Set: ms_mysql-master [ms_mysql]
>      Masters: [ app-db-slave ]
> vip_mysql       (ocf::heartbeat:IPaddr2-app):      Started app-db-slave
> 

At this point you have database in master state on two nodes. You do not
know about it, but it does not change the fact that other invisible node
still believes it *is* master.

> And logs from the master during its isolation :
> 
> Oct 10 09:23:10 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
> form a cluster because of an operating system or network fault (reason:
> totem is continuously in gather state). The most common cause of this
> message is that the local firewall is configured improperly.
> Oct 10 09:23:11 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
> form a cluster because of an operating system or network fault (reason:
> totem is continuously in gather state). The most common cause of this
> message is that the local firewall is configured improperly.
> Oct 10 09:23:13 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
> form a cluster because of an operating system or network fault (reason:
> totem is continuously in gather state). The most common cause of this
> message is that the local firewall is configured improperly.
> Oct 10 09:23:14 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
> form a cluster because of an operating system or network fault (reason:
> totem is continuously in gather state). The most common cause of this
> message is that the local firewall is configured improperly.
> Oct 10 09:23:16 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
> form a cluster because of an operating system or network fault (reason:
> totem is continuously in gather state). The most common cause of this
> message is that the local firewall is configured improperly.
> Oct 10 09:23:17 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
> form a cluster because of an operating system or network fault (reason:
> totem is continuously in gather state). The most common cause of this
> message is that the local firewall is configured improperly.
> Oct 10 09:23:19 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
> form a cluster because of an operating system or network fault (reason:
> totem is continuously in gather state). The most common cause of this
> message is that the local firewall is configured improperly.
> Oct 10 09:23:20 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
> form a cluster because of an operating system or network fault (reason:
> totem is continuously in gather state). The most common cause of this
> message is that the local firewall is configured improperly.
> Oct 10 09:23:22 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
> form a cluster because of an operating system or network fault (reason:
> totem is continuously in gather state). The most common cause of this
> message is that the local firewall is configured improperly.
> Oct 10 09:23:23 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
> form a cluster because of an operating system or network fault (reason:
> totem is continuously in gather state). The most common cause of this
> message is that the local firewall is configured improperly.
> Oct 10 09:23:25 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
> form a cluster because of an operating system or network fault (reason:
> totem is continuously in gather state). The most common cause of this
> message is that the local firewall is configured improperly.
> Oct 10 09:23:26 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
> form a cluster because of an operating system or network fault (reason:
> totem is continuously in gather state). The most common cause of this
> message is that the local firewall is configured improperly.
> Oct 10 09:23:28 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
> form a cluster because of an operating system or network fault (reason:
> totem is continuously in gather state). The most common cause of this
> message is that the local firewall is configured improperly.
> Oct 10 09:23:29 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
> form a cluster because of an operating system or network fault (reason:
> totem is continuously in gather state). The most common cause of this
> message is that the local firewall is configured improperly.
> Oct 10 09:23:31 app-db-master kernel: vmxnet3 0000:03:00.0 ens160: NIC Link
> is Up 10000 Mbps
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
> [1539156211.1436] device (ens160): carrier: link connected
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
> [1539156211.1444] device (ens160): state change: unavailable ->
> disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
> [1539156211.1456] policy: auto-activating connection 'ens160'
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
> [1539156211.1470] device (ens160): Activation: starting connection 'ens160'
> (9fe36e64-13ca-40cb-a174-5b4e16b826f4)
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
> [1539156211.1473] device (ens160): state change: disconnected -> prepare
> (reason 'none', sys-iface-state: 'managed')
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
> [1539156211.1474] manager: NetworkManager state is now CONNECTING
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
> [1539156211.1479] device (ens160): state change: prepare -> config (reason
> 'none', sys-iface-state: 'managed')
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
> [1539156211.1485] device (ens160): state change: config -> ip-config
> (reason 'none', sys-iface-state: 'managed')
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
> [1539156211.2214] device (ens160): state change: ip-config -> ip-check
> (reason 'none', sys-iface-state: 'managed')
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
> [1539156211.2235] device (ens160): state change: ip-check -> secondaries
> (reason 'none', sys-iface-state: 'managed')
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
> [1539156211.2238] device (ens160): state change: secondaries -> activated
> (reason 'none', sys-iface-state: 'managed')
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
> [1539156211.2240] manager: NetworkManager state is now CONNECTED_LOCAL
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
> [1539156211.2554] manager: NetworkManager state is now CONNECTED_SITE
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
> [1539156211.2555] policy: set 'ens160' (ens160) as default for IPv4 routing
> and DNS
> Oct 10 09:23:31 app-db-master systemd: Starting Network Manager Script
> Dispatcher Service...
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
> [1539156211.2556] device (ens160): Activation: successful, device activated.
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
> [1539156211.2564] manager: NetworkManager state is now CONNECTED_GLOBAL
> Oct 10 09:23:31 app-db-master dbus[686]: [system] Activating via systemd:
> service name='org.freedesktop.nm_dispatcher'
> unit='dbus-org.freedesktop.nm-dispatcher.service'
> Oct 10 09:23:31 app-db-master dbus[686]: [system] Successfully activated
> service 'org.freedesktop.nm_dispatcher'
> Oct 10 09:23:31 app-db-master systemd: Started Network Manager Script
> Dispatcher Service.
> Oct 10 09:23:31 app-db-master nm-dispatcher: req:1 'up' [ens160]: new
> request (3 scripts)
> Oct 10 09:23:31 app-db-master nm-dispatcher: req:1 'up' [ens160]: start
> running ordered scripts...
> Oct 10 09:23:31 app-db-master nm-dispatcher: req:2 'connectivity-change':
> new request (3 scripts)
> Oct 10 09:23:31 app-db-master nm-dispatcher: req:2 'connectivity-change':
> start running ordered scripts...
> Oct 10 09:23:31 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
> form a cluster because of an operating system or network fault (reason:
> totem is continuously in gather state). The most common cause of this
> message is that the local firewall is configured improperly.
> Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] The network
> interface [10.30.3.247] is now up.
> Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] adding new UDPU
> member {10.30.3.245}
> Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] adding new UDPU
> member {10.30.3.246}
> Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] adding new UDPU
> member {10.30.3.247}
> Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] adding new UDPU
> member {10.30.3.248}
> Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] adding new UDPU
> member {10.30.3.249}
> 

I do not see anything from pacemaker. In particular, I expect pacemaker
stop all resources (no-quorum-policy defaults to "stop").

> As you can see, node is back online and can communicate again with other
> nodes, so pacemaker start mysql as expected and promote it as slave :
> 

Again, hard to say anything without pacemaker logs. In partocular, it is
not clear, was mysql stopped and started or just demoted.

> Node aoo-quorum: standby
> Online: [ app-central-master app-central-slave app-db-master app-db-slave ]
> 
> Active resources:
> 
>  Master/Slave Set: ms_mysql-master [ms_mysql]
>      Masters: [ app-db-slave ]
>      Slaves: [ app-db-master ]
> 
> Resource-agents oriented logs are below :
> 
> Master :
> Oct 10 09:24:01 app-db-master crmd[5177]:  notice: Result of demote
> operation for ms_mysql on app-db-master: 0 (ok)
> Oct 10 09:24:02 app-db-master mysql-app(ms_mysql)[5592]: INFO:
> app-db-master Ignoring post-demote notification for my own demotion.
> Oct 10 09:24:02 app-db-master crmd[5177]:  notice: Result of notify
> operation for ms_mysql on app-db-master: 0 (ok)
> 

That's again incomplete.

> Slave:
> 
> Oct 10 09:24:01 app-db-slave crmd[1175]:  notice: Result of notify
> operation for ms_mysql on app-db-slave: 0 (ok)
> Oct 10 09:24:02 app-db-slave mysql-app(ms_mysql)[22969]: INFO: app-db-slave
> Ignoring pre-demote notification execpt for my own demotion.
> Oct 10 09:24:02 app-db-slave crmd[1175]:  notice: Result of notify
> operation for ms_mysql on app-db-slave: 0 (ok)
> Oct 10 09:24:03 app-db-slave mysql-app(ms_mysql)[22999]: INFO: app-db-slave
> post-demote notification for app-db-master.
> Oct 10 09:24:03 app-db-slave mysql-app(ms_mysql)[22999]: WARNING: Attempted
> to unset the replication master on an instance that is not configured as a
> replication slave
> Oct 10 09:24:03 app-db-slave crmd[1175]:  notice: Result of notify
> operation for ms_mysql on app-db-slave: 0 (ok)
> 
> So I expect to have a running replication at this point,

Why? Agent starts replication (at least, sets master to replicate from)
in two cases:

a) node is notified about promotion of other node. In this case
app-db-master was disconnected so it could not be notified about promotion.

b) Resource is started when resource is master on another node. From
logs you provided it cannot be determined what exactly happened on
app-db-master node. If it was simply demoted (the only log entries are
about demotion) then no replication was configured either.

> but when I perform
> SHOW SLAVE STATUS on my *new* slave, I get an empty response :
> 
> MariaDB [(none)]> SHOW SLAVE STATUS \G
> Empty set (0.00 sec)
> 
> MariaDB [(none)]> Ctrl-C -- exit!
> Aborted
> (reverse-i-search)`ab': systemctl en^Cle corosync
> [root at app-db-master ~]# bash
> /etc/app-failover/mysql-exploit/mysql-check-status.sh
> Connection Status 'app-db-master' [OK]
> Connection Status 'app-db-slave' [OK]
> Slave Thread Status [KO]
> Error reports:
>     No slave (maybe because we cannot check a server).
> Position Status [SKIP]
> Error reports:
>     Skip because we can't identify a unique slave.
> 
> From what I understand the is_slave function from
> https://github.com/ClusterLabs/resource-agents/blob/RHEL6/heartbeat/mysql
> works as expected, because as it gets an empty set when performing the
> monitor action it does not consider it as a replication slave, so I guess
> there is an issue from the issue already presented above and the
> CHANGE_MASTER_TO query that failed because of error "ERROR 1200 (HY000) at
> line 1: Misconfigured slave: MASTER_HOST was not set;"
> 
> I may miss something obvious .. Please tell me if I can bring more
> information around my issue.
> 

You need stonith. Only in this way you can ensure known resource state
and reliable state stransition. Without stonith it is unpredictable when
node joins cluster again and in which state resources are when it
happens. With stonith cluster would wait for node being eliminated and
only then promoted new master; then on startup of app-db-master
pacemaker would wait for other nodes and would start resources only
after it joined cluster. Which means mysql on this node would see master
on another node and hopefully configure itself as slave. At least this
is how I interpret

    if ocf_is_ms; then
        # We're configured as a stateful resource. We must start as
        # slave by default. At this point we don't know if the CRM has
        # already promoted a master. So, we simply start in read only
        # mode.
        set_read_only on

        # Now, let's see whether there is a master. We might be a new
        # node that is just joining the cluster, and the CRM may have
        # promoted a master before.
        master_host=`echo $OCF_RESKEY_CRM_meta_notify_master_uname|tr -d
" "`
        if [ "$master_host" -a "$master_host" != ${NODENAME} ]; then
            ocf_log info "Changing MySQL configuration to replicate from
$master_host."
            set_master
            start_slave


More information about the Users mailing list