[Pacemaker] [SOLVED] RE: Slave does not start after failover: Mysql circular replication and master-slave resources

Tue Dec 20 22:16:42 UTC 2011

Hello Attila,

On 12/20/2011 09:29 AM, Attila Megyeri wrote:
> Hi Andreas,
> 
> 
> -----Original Message-----
> From: Andreas Kurz [mailto:andreas at hastexo.com] 
> Sent: 2011. december 19. 15:19
> To: pacemaker at oss.clusterlabs.org
> Subject: Re: [Pacemaker] [SOLVED] RE: Slave does not start after failover: Mysql circular replication and master-slave resources
> 
> On 12/17/2011 10:51 AM, Attila Megyeri wrote:
>> Hi all,
>>
>> For anyone interested.
>> I finally made the mysql replication work. For some strange reason there were no [mysql] log entries at all neither in corosync.log nor in the syslog. After a couple of corosync restarts (?!) [mysql] RA debug/error entries started to show up.
>> The issue was that the slave could not apply the binary logs due to some duplicate errors. I am not sure how this could happen, but the solution was to ignore the duplicate errors on the slaves, by adding the following line to the my.conf:
>>
>> slave-skip-errors = 1062
> 
> although you use different "auto-increment-offset" values?
> 
> 
> Yes... I am actually quite surprised how this can happen. The slave has applied the binlog already, but for some reason it wants to execute it again.
> 
>>
>> I hope this helps to some of you guys as well.
>>
>> P.S. Did anyone else notice missing mysql debug/info/error entries in corosync log as well?
> 
> There is no RA output/log in any of your syslogs? ... in absence of a connected tty and no configured logd, logger should feed all logs to syslog ... what is your distribution, any "fancy" syslog configuration?
> 
> My system is running on a Debian squeeze, pacemaker 1.1.5 squeeze backport. The syslog configuration is standard, no extras. I have noticed this strange behavior (RA not logging anything) many times - not only for the mysql resource but also for postgres. E.g. I added a log_ocf at the entry point of the RA, just to log when the script is executed and what parameter was passed - but I did not see any "monitor" invokes either.
> Now it works fine, but this is not an absolutely stable setup.

seems to be "improvable", see:

http://bugs.clusterlabs.org/show_bug.cgi?id=5024

> 
> One other very disturbing issue is, that sometimes corosync and some of the heartbeat processes stuck at 100% CPU and only restart/kill -9 help. :(

now that looks really ugly ... are you using MCP or let corosync start
pacemaker?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Cheers,
> 
> Attila
> 
> Regards,
> Andreas
> 
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
> 
>>
>> Cheers,
>> Attila
>>
>>
>> -----Original Message-----
>> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
>> Sent: 2011. december 16. 12:39
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Slave does not start after failover: Mysql 
>> circular replication and master-slave resources
>>
>> Hi Andreas,
>>
>> The slave lag cannot be high, as the slave was restarted within 1-2 mins and there are no active users on the system yet.
>> I did not find anything at all in the logs.
>>
>> I will doublecheck if the RA is the latest.
>>
>> Thanks,
>>
>> Attila
>>
>>
>> -----Original Message-----
>> From: Andreas Kurz [mailto:andreas at hastexo.com]
>> Sent: 2011. december 16. 1:50
>> To: pacemaker at oss.clusterlabs.org
>> Subject: Re: [Pacemaker] Slave does not start after failover: Mysql 
>> circular replication and master-slave resources
>>
>> Hello Attila,
>>
>> ... see below ...
>>
>> On 12/15/2011 02:42 PM, Attila Megyeri wrote:
>>> Hi All,
>>>
>>>  
>>>
>>> Some time ago I exchanged a couple of posts with you here regarding 
>>> Mysql active-active HA.
>>>
>>> The best solution I found so  far was the Mysql multi-master 
>>> replication, also referred to as circular replication.
>>>
>>>  
>>>
>>> Basically I set up two nodes, both were capable of the master role, 
>>> and the changes were immediately propagated to the other node.
>>>
>>>  
>>>
>>> But still I wanted to have a M/S approach, to have a RW master and a 
>>> RO slave - mainly because I prefer to have a signle master VIP where 
>>> my apps can connect to.
>>>
>>>  
>>>
>>> (In the first approach I configured a two node clone, and the master 
>>> IP was always bound to one of the nodes)
>>>
>>>  
>>>
>>> I applied the following configuration:
>>>
>>>  
>>>
>>> node db1 \
>>>
>>>         attributes IP="10.100.1.31" \
>>>
>>>         attributes standby="off"
>>> db2-log-file-db-mysql="mysql-bin.000021" db2-log-pos-db-mysql="40730"
>>>
>>> node db2 \
>>>
>>>         attributes IP="10.100.1.32" \
>>>
>>>         attributes standby="off"
>>>
>>> primitive db-ip-master ocf:heartbeat:IPaddr2 \
>>>
>>>         params lvs_support="true" ip="10.100.1.30" cidr_netmask="8"
>>> broadcast="10.255.255.255" \
>>>
>>>         op monitor interval="20s" timeout="20s" \
>>>
>>>         meta target-role="Started"
>>>
>>> primitive db-mysql ocf:heartbeat:mysql \
>>>
>>>         params binary="/usr/bin/mysqld_safe" config="/etc/mysql/my.cnf"
>>> datadir="/var/lib/mysql" user="mysql" pid="/var/run/mysqld/mysqld.pid"
>>> socket="/var/run/mysqld/mysqld.sock" test_passwd="XXXXX"
>>>
>>>         test_table="replicatest.connectioncheck" test_user="slave_user"
>>> replication_user="slave_user" replication_passwd="XXXXX"
>>> additional_parameters="--skip-slave-start" \
>>>
>>>         op start interval="0" timeout="120s" \
>>>
>>>         op stop interval="0" timeout="120s" \
>>>
>>>         op monitor interval="30" timeout="30s" OCF_CHECK_LEVEL="1" \
>>>
>>>         op promote interval="0" timeout="120" \
>>>
>>>         op demote interval="0" timeout="120"
>>>
>>> ms db-ms-mysql db-mysql \
>>>
>>>         meta notify="true" master-max="1" clone-max="2"
>>> target-role="Started"
>>>
>>> colocation db-ip-with-master inf: db-ip-master db-ms-mysql:Master
>>>
>>> property $id="cib-bootstrap-options" \
>>>
>>>         dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>>>
>>>         cluster-infrastructure="openais" \
>>>
>>>         expected-quorum-votes="2" \
>>>
>>>         stonith-enabled="false" \
>>>
>>>         no-quorum-policy="ignore"
>>>
>>> rsc_defaults $id="rsc-options" \
>>>
>>>         resource-stickiness="0"
>>>
>>>  
>>>
>>>  
>>>
>>> The setup works in the basic conditions:
>>>
>>> *         After the "first" startup, nodes start up as slaves, and
>>> shortly after, one of them is promoted to master.
>>>
>>> *         Updates to the master are replicated properly to the slave.
>>>
>>> *         Slave accepts updates, which is Wrong, but I can live with
>>> this - I will allow connect to the Master VIP only.
>>>
>>> *         If I stop the slave for some time, and re-start it, it will
>>> catch up with the master shortly and get into sync.
>>>
>>>  
>>>
>>> I have, however a serious issue:
>>>
>>> *         If I stop the current master, the slave is promoted, accepts
>>> RW queries, the Master IP is bound to it - ALL fine.
>>>
>>> *         BUT - when I want to bring the other node online, it simply
>>> shows: Stopped (not installed)
>>>
>>>  
>>>
>>> Online: [ db1 db2 ]
>>>
>>>  
>>>
>>> db-ip-master    (ocf::heartbeat:IPaddr2):       Started db1
>>>
>>> Master/Slave Set: db-ms-mysql [db-mysql]
>>>
>>>      Masters: [ db1 ]
>>>
>>>      Stopped: [ db-mysql:1 ]
>>>
>>>  
>>>
>>> Node Attributes:
>>>
>>> * Node db1:
>>>
>>>     + IP                                : 10.100.1.31
>>>
>>>     + db2-log-file-db-mysql             : mysql-bin.000021
>>>
>>>     + db2-log-pos-db-mysql              : 40730
>>>
>>>     + master-db-mysql:0                 : 3601
>>>
>>> * Node db2:
>>>
>>>     + IP                                : 10.100.1.32
>>>
>>>  
>>>
>>> Failed actions:
>>>
>>>     db-mysql:0_monitor_30000 (node=db2, call=58, rc=5, status=complete):
>>> not installed
>>>
>>
>> Looking at the RA (latest from git) I'd say the problem is somewhere in the check_slave() function. Either the check for replication errors or for a too high slave lag ... though on both errors you should see the log. entries.
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>
>>>  
>>>
>>>  
>>>
>>> I checked the logs, and could not find a reason why the slave at db2 
>>> is not started.
>>>
>>> Any IDEA Anyone ?
>>>
>>>  
>>>
>>>  
>>>
>>> Thanks,
>>>
>>> Attila
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started: 
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 286 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111220/ab1e456a/attachment-0004.sig>