[ClusterLabs] Questions about pacemaker/ mysql resource agent behaviour when network fail

Wed Oct 10 10:18:45 UTC 2018

Le sam. 6 oct. 2018 à 06:13, Andrei Borzenkov <arvidjaar at gmail.com> a
écrit :

> 05.10.2018 15:00, Simon Bomm пишет:
> > Hi all,
> >
> > Using pacemaker 1.1.18-11 and mysql resource agent (
> >
> https://github.com/ClusterLabs/resource-agents/blob/RHEL6/heartbeat/mysql
> ),
> > I run into an unwanted behaviour. My point of view of course, maybe it's
> > expected to be as it is that's why I ask.
> >
> > # My test case is the following :
> >
> > Everything is OK on my cluster, crm_mon output is as below (no failed
> > actions)
> >
> >  Master/Slave Set: ms_mysql-master [ms_mysql]
> >      Masters: [ db-master ]
> >      Slaves: [ db-slave ]
> >
> > 1. I insert in a table on master, no issue data is replicated.
> > 2. I shut down net int on the master (vm),
>
>
First, thanks for taking time to answer me

> What exactly does it mean? How do you shut down net?
>
>
Disconnect the network card from VMWare vSphere Console

> > pacemaker correctly start on the
> > other node. Master is seen as offline, and db-slave is now master
> >
> >  Master/Slave Set: ms_mysql-master [ms_mysql]
> >      Masters: [ db-slave ]
> >
> > 3. I bring back my net int up, pacemaker see the node online and set the
> > old-master as a the new slave :
> >
> >  Master/Slave Set: ms_mysql-master [ms_mysql]
> >      Masters: [ db-slave ]
> >      Slaves: [ db-master ]
> >
> > 4. From this point, my external monitoring bash script shows that SQL and
> > IO thread are not running, but I can't see any error in the pcs
> > status/crm_mon outputs.
>
> Pacemaker just shows what resource agents claim. If resource agent
> claims resource is started, there is nothing pacemaker can do. You need
> to debug what resource agent does.
>
>
I've debugged it quite a lot, and that's what drove me to isolate error
below :

> mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE"
> ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was not
set;
> Fix in config file or with CHANGE MASTER TO

> > Consequence is that I continue inserting on my new
> > promoted master but the data is never consumed by my former master
> computer.
> >
> > # Questions :
> >
> > - Is this some kind of safety behaviour to avoid data corruption when a
> > node is back online ?
> > - When I want to manually start it like ocf does it returns this error :
> >
> > mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE"
> > ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was not
> set;
> > Fix in config file or with CHANGE MASTER TO
> >
> > - I would expect the cluster to stop the slave and show a failed action,
> am
> > I wrong here ?
> >
>
> I am not familiar with specific application and its structure. From
> quick browsing monitor action does mostly check for running process. Is
> mySQL process running?
>

Yes it is, as you mentionned previously the config wants pacemaker to start
mysql resource so no problems.

>
> > # Other details (not sure it matters a lot)
> >
> > No stonith enabled, no fencing or auto-failback.
>
> How are you going to resolve split-brain without stonith? "Stopping net"
> sounds exactly like split brain, in which case further investigation is
> rather pointless.
>
>
You make the point, as I'm not very familiar with stonithd, I first disable
this to avoid unwanted behaviour but I'll definitely follow your advise and
dig around.

> Anyway, to give some non-hypothetical answer full configuration and logs
> from both systems are needed.
>
>
Sure, please find the full configuration

Cluster Name: app_cluster
Corosync Nodes:
 app-central-master app-central-slave app-db-master app-db-slave app-quorum
Pacemaker Nodes:
 app-central-master app-central-slave app-db-master app-db-slave app-quorum

Resources:
 Master: ms_mysql-master
  Meta Attrs: master-node-max=1 clone_max=2 globally-unique=false
clone-node-max=1 notify=true
  Resource: ms_mysql (class=ocf provider=heartbeat type=mysql-app)
   Attributes: binary=/usr/bin/mysqld_safe config=/etc/my.cnf.d/server.cnf
datadir=/var/lib/mysql evict_outdated_slaves=false max_slave_lag=15
pid=/var/lib/mysql/mysql.pid replication_passwd=mysqlreplpw
replication_user=app-repl socket=/var/lib/mysql/mysql.sock
test_passwd=mysqlrootpw test_user=root
   Operations: demote interval=0s timeout=120 (ms_mysql-demote-interval-0s)
               monitor interval=20 timeout=30 (ms_mysql-monitor-interval-20)
               monitor interval=10 role=Master timeout=30
(ms_mysql-monitor-interval-10)
               monitor interval=30 role=Slave timeout=30
(ms_mysql-monitor-interval-30)
               notify interval=0s timeout=90 (ms_mysql-notify-interval-0s)
               promote interval=0s timeout=120
(ms_mysql-promote-interval-0s)
               start interval=0s timeout=120 (ms_mysql-start-interval-0s)
               stop interval=0s timeout=120 (ms_mysql-stop-interval-0s)
 Resource: vip_mysql (class=ocf provider=heartbeat type=IPaddr2-app)
  Attributes: broadcast=10.30.255.255 cidr_netmask=16 flush_routes=true
ip=10.30.3.229 nic=ens160
  Operations: monitor interval=10s timeout=20s
(vip_mysql-monitor-interval-10s)
              start interval=0s timeout=20s (vip_mysql-start-interval-0s)
              stop interval=0s timeout=20s (vip_mysql-stop-interval-0s)
 Group: app
  Resource: misc_app (class=ocf provider=heartbeat type=misc-app)
   Attributes: crondir=/etc/app-failover/resources/cron/,/etc/cron.d/
   Meta Attrs: target-role=started
   Operations: monitor interval=5s timeout=20s
(misc_app-monitor-interval-5s)
               start interval=0s timeout=20s (misc_app-start-interval-0s)
               stop interval=0s timeout=20s (misc_app-stop-interval-0s)
  Resource: cbd_central_broker (class=ocf provider=heartbeat
type=cbd-central-broker)
   Meta Attrs: target-role=started
   Operations: monitor interval=5s timeout=20s
(cbd_central_broker-monitor-interval-5s)
               start interval=0s timeout=90s
(cbd_central_broker-start-interval-0s)
               stop interval=0s timeout=90s
(cbd_central_broker-stop-interval-0s)
  Resource: centcore (class=ocf provider=heartbeat type=centcore)
   Meta Attrs: target-role=started
   Operations: monitor interval=5s timeout=20s
(centcore-monitor-interval-5s)
               start interval=0s timeout=90s (centcore-start-interval-0s)
               stop interval=0s timeout=90s (centcore-stop-interval-0s)
  Resource: apptrapd (class=ocf provider=heartbeat type=apptrapd)
   Meta Attrs: target-role=started
   Operations: monitor interval=5s timeout=20s
(apptrapd-monitor-interval-5s)
               start interval=0s timeout=90s (apptrapd-start-interval-0s)
               stop interval=0s timeout=90s (apptrapd-stop-interval-0s)
  Resource: app_central_sync (class=ocf provider=heartbeat
type=app-central-sync)
   Meta Attrs: target-role=started
   Operations: monitor interval=5s timeout=20s
(app_central_sync-monitor-interval-5s)
               start interval=0s timeout=90s
(app_central_sync-start-interval-0s)
               stop interval=0s timeout=90s
(app_central_sync-stop-interval-0s)
  Resource: snmptrapd (class=ocf provider=heartbeat type=snmptrapd)
   Meta Attrs: target-role=started
   Operations: monitor interval=5s timeout=20s
(snmptrapd-monitor-interval-5s)
               start interval=0s timeout=90s (snmptrapd-start-interval-0s)
               stop interval=0s timeout=90s (snmptrapd-stop-interval-0s)
  Resource: http (class=ocf provider=heartbeat type=apacheapp)
   Meta Attrs: target-role=started
   Operations: monitor interval=5s timeout=20s (http-monitor-interval-5s)
               start interval=0s timeout=40s (http-start-interval-0s)
               stop interval=0s timeout=60s (http-stop-interval-0s)
  Resource: vip_app (class=ocf provider=heartbeat type=IPaddr2-app)
   Attributes: broadcast=10.30.255.255 cidr_netmask=16 flush_routes=true
ip=10.30.3.230 nic=ens160
   Meta Attrs: target-role=started
   Operations: monitor interval=10s timeout=20s
(vip_app-monitor-interval-10s)
               start interval=0s timeout=20s (vip_app-start-interval-0s)
               stop interval=0s timeout=20s (vip_app-stop-interval-0s)
  Resource: centengine (class=ocf provider=heartbeat type=centengine)
   Meta Attrs: multiple-active=stop_start target-role=started
   Operations: monitor interval=5s timeout=20s
(centengine-monitor-interval-5s)
               start interval=0s timeout=90s (centengine-start-interval-0s)
               stop interval=0s timeout=90s (centengine-stop-interval-0s)

Stonith Devices:
Fencing Levels:

Location Constraints:
  Resource: app
    Disabled on: app-db-master (score:-INFINITY)
(id:location-app-app-db-master--INFINITY)
    Disabled on: app-db-slave (score:-INFINITY)
(id:location-app-app-db-slave--INFINITY)
  Resource: ms_mysql
    Disabled on: app-central-master (score:-INFINITY)
(id:location-ms_mysql-app-central-master--INFINITY)
    Disabled on: app-central-slave (score:-INFINITY)
(id:location-ms_mysql-app-central-slave--INFINITY)
  Resource: vip_mysql
    Disabled on: app-central-master (score:-INFINITY)
(id:location-vip_mysql-app-central-master--INFINITY)
    Disabled on: app-central-slave (score:-INFINITY)
(id:location-vip_mysql-app-central-slave--INFINITY)
Ordering Constraints:
Colocation Constraints:
  vip_mysql with ms_mysql-master (score:INFINITY) (rsc-role:Started)
(with-rsc-role:Master)
  ms_mysql-master with vip_mysql (score:INFINITY) (rsc-role:Master)
(with-rsc-role:Started)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 resource-stickiness: INFINITY
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: app_cluster
 dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9
 have-watchdog: false
 last-lrm-refresh: 1538740285
 ms_mysql_REPL_INFO: app-db-master|mysql-bin.000012|327
 stonith-enabled: false
 symmetric-cluster: true
Node Attributes:
 app-quorum: standby=on

Quorum:
  Options:
  Device:
    votes: 1
    Model: net
      algorithm: ffsplit
      host: app-quorum

Logs are below

SLAVE when I disconnect interface (node is isolated), and associated
crm_mon, lgtm and can get the behaviour :

Oct 10 09:20:07 app-db-slave corosync[1055]: [TOTEM ] A processor failed,
forming new configuration.
Oct 10 09:20:11 app-db-slave corosync[1055]: [TOTEM ] A new membership (
10.30.3.245:196) was formed. Members left: 3
Oct 10 09:20:11 app-db-slave corosync[1055]: [TOTEM ] Failed to receive the
leave message. failed: 3
Oct 10 09:20:11 app-db-slave corosync[1055]: [QUORUM] Members[4]: 1 2 4 5
Oct 10 09:20:11 app-db-slave corosync[1055]: [MAIN  ] Completed service
synchronization, ready to provide service.
Oct 10 09:20:11 app-db-slave cib[1168]:  notice: Node app-db-master state
is now lost
Oct 10 09:20:11 app-db-slave attrd[1172]:  notice: Node app-db-master state
is now lost
Oct 10 09:20:11 app-db-slave attrd[1172]:  notice: Removing all
app-db-master attributes for peer loss
Oct 10 09:20:11 app-db-slave stonith-ng[1170]:  notice: Node app-db-master
state is now lost
Oct 10 09:20:11 app-db-slave pacemakerd[1084]:  notice: Node app-db-master
state is now lost
Oct 10 09:20:11 app-db-slave crmd[1175]:  notice: Node app-db-master state
is now lost
Oct 10 09:20:11 app-db-slave cib[1168]:  notice: Purged 1 peer with id=3
and/or uname=app-db-master from the membership cache
Oct 10 09:20:11 app-db-slave stonith-ng[1170]:  notice: Purged 1 peer with
id=3 and/or uname=app-db-master from the membership cache
Oct 10 09:20:11 app-db-slave attrd[1172]:  notice: Purged 1 peer with id=3
and/or uname=app-db-master from the membership cache
Oct 10 09:20:11 app-db-slave crmd[1175]:  notice: Result of notify
operation for ms_mysql on app-db-slave: 0 (ok)
Oct 10 09:20:12 app-db-slave mysql-app(ms_mysql)[21165]: INFO: app-db-slave
promote is starting
Oct 10 09:20:12 app-db-slave IPaddr2-app(vip_mysql)[21134]: INFO: Adding
inet address 10.30.3.229/16 with broadcast address 10.30.255.255 to device
ens160
Oct 10 09:20:12 app-db-slave IPaddr2-app(vip_mysql)[21134]: INFO: Bringing
device ens160 up
Oct 10 09:20:12 app-db-slave IPaddr2-app(vip_mysql)[21134]: INFO:
/usr/libexec/heartbeat/send_arp -i 200 -c 5 -I ens160 -s 10.30.3.229
10.30.255.255
Oct 10 09:20:12 app-db-slave crmd[1175]:  notice: Result of start operation
for vip_mysql on app-db-slave: 0 (ok)
Oct 10 09:20:12 app-db-slave lrmd[1171]:  notice:
ms_mysql_promote_0:21165:stderr [ Error performing operation: No such
device or address ]
Oct 10 09:20:12 app-db-slave crmd[1175]:  notice: Result of promote
operation for ms_mysql on app-db-slave: 0 (ok)
Oct 10 09:20:12 app-db-slave mysql-app(ms_mysql)[21285]: INFO: app-db-slave
This will be the new master, ignoring post-promote notification.
Oct 10 09:20:12 app-db-slave crmd[1175]:  notice: Result of notify
operation for ms_mysql on app-db-slave: 0 (ok)

Node app-quorum: standby
Online: [ app-central-master app-central-slave app-db-slave ]
OFFLINE: [ app-db-master ]

Active resources:

 Master/Slave Set: ms_mysql-master [ms_mysql]
     Masters: [ app-db-slave ]
vip_mysql       (ocf::heartbeat:IPaddr2-app):      Started app-db-slave

And logs from the master during its isolation :

Oct 10 09:23:10 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.
Oct 10 09:23:11 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.
Oct 10 09:23:13 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.
Oct 10 09:23:14 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.
Oct 10 09:23:16 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.
Oct 10 09:23:17 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.
Oct 10 09:23:19 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.
Oct 10 09:23:20 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.
Oct 10 09:23:22 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.
Oct 10 09:23:23 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.
Oct 10 09:23:25 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.
Oct 10 09:23:26 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.
Oct 10 09:23:28 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.
Oct 10 09:23:29 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.
Oct 10 09:23:31 app-db-master kernel: vmxnet3 0000:03:00.0 ens160: NIC Link
is Up 10000 Mbps
Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
[1539156211.1436] device (ens160): carrier: link connected
Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
[1539156211.1444] device (ens160): state change: unavailable ->
disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
[1539156211.1456] policy: auto-activating connection 'ens160'
Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
[1539156211.1470] device (ens160): Activation: starting connection 'ens160'
(9fe36e64-13ca-40cb-a174-5b4e16b826f4)
Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
[1539156211.1473] device (ens160): state change: disconnected -> prepare
(reason 'none', sys-iface-state: 'managed')
Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
[1539156211.1474] manager: NetworkManager state is now CONNECTING
Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
[1539156211.1479] device (ens160): state change: prepare -> config (reason
'none', sys-iface-state: 'managed')
Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
[1539156211.1485] device (ens160): state change: config -> ip-config
(reason 'none', sys-iface-state: 'managed')
Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
[1539156211.2214] device (ens160): state change: ip-config -> ip-check
(reason 'none', sys-iface-state: 'managed')
Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
[1539156211.2235] device (ens160): state change: ip-check -> secondaries
(reason 'none', sys-iface-state: 'managed')
Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
[1539156211.2238] device (ens160): state change: secondaries -> activated
(reason 'none', sys-iface-state: 'managed')
Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
[1539156211.2240] manager: NetworkManager state is now CONNECTED_LOCAL
Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
[1539156211.2554] manager: NetworkManager state is now CONNECTED_SITE
Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
[1539156211.2555] policy: set 'ens160' (ens160) as default for IPv4 routing
and DNS
Oct 10 09:23:31 app-db-master systemd: Starting Network Manager Script
Dispatcher Service...
Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
[1539156211.2556] device (ens160): Activation: successful, device activated.
Oct 10 09:23:31 app-db-master NetworkManager[692]: <info>
[1539156211.2564] manager: NetworkManager state is now CONNECTED_GLOBAL
Oct 10 09:23:31 app-db-master dbus[686]: [system] Activating via systemd:
service name='org.freedesktop.nm_dispatcher'
unit='dbus-org.freedesktop.nm-dispatcher.service'
Oct 10 09:23:31 app-db-master dbus[686]: [system] Successfully activated
service 'org.freedesktop.nm_dispatcher'
Oct 10 09:23:31 app-db-master systemd: Started Network Manager Script
Dispatcher Service.
Oct 10 09:23:31 app-db-master nm-dispatcher: req:1 'up' [ens160]: new
request (3 scripts)
Oct 10 09:23:31 app-db-master nm-dispatcher: req:1 'up' [ens160]: start
running ordered scripts...
Oct 10 09:23:31 app-db-master nm-dispatcher: req:2 'connectivity-change':
new request (3 scripts)
Oct 10 09:23:31 app-db-master nm-dispatcher: req:2 'connectivity-change':
start running ordered scripts...
Oct 10 09:23:31 app-db-master corosync[1029]: [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault (reason:
totem is continuously in gather state). The most common cause of this
message is that the local firewall is configured improperly.
Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] The network
interface [10.30.3.247] is now up.
Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] adding new UDPU
member {10.30.3.245}
Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] adding new UDPU
member {10.30.3.246}
Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] adding new UDPU
member {10.30.3.247}
Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] adding new UDPU
member {10.30.3.248}
Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] adding new UDPU
member {10.30.3.249}

As you can see, node is back online and can communicate again with other
nodes, so pacemaker start mysql as expected and promote it as slave :

Node aoo-quorum: standby
Online: [ app-central-master app-central-slave app-db-master app-db-slave ]

Active resources:

 Master/Slave Set: ms_mysql-master [ms_mysql]
     Masters: [ app-db-slave ]
     Slaves: [ app-db-master ]

Resource-agents oriented logs are below :

Master :
Oct 10 09:24:01 app-db-master crmd[5177]:  notice: Result of demote
operation for ms_mysql on app-db-master: 0 (ok)
Oct 10 09:24:02 app-db-master mysql-app(ms_mysql)[5592]: INFO:
app-db-master Ignoring post-demote notification for my own demotion.
Oct 10 09:24:02 app-db-master crmd[5177]:  notice: Result of notify
operation for ms_mysql on app-db-master: 0 (ok)

Slave:

Oct 10 09:24:01 app-db-slave crmd[1175]:  notice: Result of notify
operation for ms_mysql on app-db-slave: 0 (ok)
Oct 10 09:24:02 app-db-slave mysql-app(ms_mysql)[22969]: INFO: app-db-slave
Ignoring pre-demote notification execpt for my own demotion.
Oct 10 09:24:02 app-db-slave crmd[1175]:  notice: Result of notify
operation for ms_mysql on app-db-slave: 0 (ok)
Oct 10 09:24:03 app-db-slave mysql-app(ms_mysql)[22999]: INFO: app-db-slave
post-demote notification for app-db-master.
Oct 10 09:24:03 app-db-slave mysql-app(ms_mysql)[22999]: WARNING: Attempted
to unset the replication master on an instance that is not configured as a
replication slave
Oct 10 09:24:03 app-db-slave crmd[1175]:  notice: Result of notify
operation for ms_mysql on app-db-slave: 0 (ok)

So I expect to have a running replication at this point, but when I perform
SHOW SLAVE STATUS on my *new* slave, I get an empty response :

MariaDB [(none)]> SHOW SLAVE STATUS \G
Empty set (0.00 sec)

MariaDB [(none)]> Ctrl-C -- exit!
Aborted
(reverse-i-search)`ab': systemctl en^Cle corosync
[root at app-db-master ~]# bash
/etc/app-failover/mysql-exploit/mysql-check-status.sh
Connection Status 'app-db-master' [OK]
Connection Status 'app-db-slave' [OK]
Slave Thread Status [KO]
Error reports:
    No slave (maybe because we cannot check a server).
Position Status [SKIP]
Error reports:
    Skip because we can't identify a unique slave.

>From what I understand the is_slave function from
https://github.com/ClusterLabs/resource-agents/blob/RHEL6/heartbeat/mysql
works as expected, because as it gets an empty set when performing the
monitor action it does not consider it as a replication slave, so I guess
there is an issue from the issue already presented above and the
CHANGE_MASTER_TO query that failed because of error "ERROR 1200 (HY000) at
line 1: Misconfigured slave: MASTER_HOST was not set;"

I may miss something obvious .. Please tell me if I can bring more
information around my issue.

Rgds

> Symetric cluster
> > configured.
> >
> > Details of my pacemaker resource configuration is
> >
> >  Master: ms_mysql-master
> >   Meta Attrs: master-node-max=1 clone_max=2 globally-unique=false
> > clone-node-max=1 notify=true
> >   Resource: ms_mysql (class=ocf provider=heartbeat type=mysql)
> >    Attributes: binary=/usr/bin/mysqld_safe
> config=/etc/my.cnf.d/server.cnf
> > datadir=/var/lib/mysql evict_outdated_slaves=false max_slave_lag=15
> > pid=/var/lib/mysql/mysql.pid replication_passwd=mysqlreplpw
> > replication_user=user-repl socket=/var/lib/mysql/mysql.sock
> > test_passwd=mysqlrootpw test_user=root
> >    Operations: demote interval=0s timeout=120
> (ms_mysql-demote-interval-0s)
> >                monitor interval=20 timeout=30
> (ms_mysql-monitor-interval-20)
> >                monitor interval=10 role=Master timeout=30
> > (ms_mysql-monitor-interval-10)
> >                monitor interval=30 role=Slave timeout=30
> > (ms_mysql-monitor-interval-30)
> >                notify interval=0s timeout=90
> (ms_mysql-notify-interval-0s)
> >                promote interval=0s timeout=120
> > (ms_mysql-promote-interval-0s)
> >                start interval=0s timeout=120 (ms_mysql-start-interval-0s)
> >                stop interval=0s timeout=120 (ms_mysql-stop-interval-0s)
> >
> > Any things I'm missing on this ? Did not find a clearly similar usecase
> > when googling around network outage and pacemaker.
> >
> > Thanks
> >
> >
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20181010/46613691/attachment-0001.html>