[Pacemaker] Mysql multiple slaves, slaves restarting occasionally without a reason

Andrew Beekhof andrew at beekhof.net
Thu Sep 19 07:24:18 EDT 2013


On 10/09/2013, at 4:07 PM, Attila Megyeri <amegyeri at minerva-soft.com> wrote:

> Hi,
>  
> We have a Mysql cluster which works fine when I have a single master (“A”) and slave (“B”). Failover is almost immediate and I am happy with this approach.
> When we configured two additional slaves, strange things start to happen. From time to time I am noticing that all slaves mysql instances are restarted and I cannot figure out why.
>  
> I tried to find out what is happening, and this is how far I got:
>  
> There is a repeating sequence in the DC, which looks like this when everything is fine:
>  
> Sep 10 01:45:42 oamgr crmd: [3385]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Sep 10 01:45:42 oamgr crmd: [3385]: info: do_te_invoke: Processing graph 71358 (ref=pe_calc-dc-1378777542-165977) derived from /var/lib/pengine/pe-input-3179.bz2
> Sep 10 01:45:42 oamgr crmd: [3385]: notice: run_graph: ==== Transition 71358 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-3179.bz2): Complete
> Sep 10 01:45:42 oamgr crmd: [3385]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> Sep 10 01:47:42 oamgr crmd: [3385]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (120000ms)
> Sep 10 01:47:42 oamgr crmd: [3385]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Sep 10 01:47:42 oamgr crmd: [3385]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
> ….
>  
> But
>  
> It looks somewhat different when I see the restarts:
>  
> ….
> Sep 10 01:51:42 oamgr crmd: [3385]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Sep 10 01:51:42 oamgr crmd: [3385]: info: do_te_invoke: Processing graph 71361 (ref=pe_calc-dc-1378777902-165980) derived from /var/lib/pengine/pe-input-3179.bz2
> Sep 10 01:51:42 oamgr crmd: [3385]: notice: run_graph: ==== Transition 71361 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-3179.bz2): Complete
> Sep 10 01:51:42 oamgr crmd: [3385]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> Sep 10 01:52:45 oamgr crmd: [3385]: info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-oadb2-master-db-mysql.1, name=master-db-mysql:1, value=0, magic=NA, cib=0.4829.3480) : Transient attribute: update
> Sep 10 01:52:45 oamgr crmd: [3385]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Sep 10 01:52:45 oamgr crmd: [3385]: info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-oadb2-readable, name=readable, value=0, magic=NA, cib=0.4829.3481) : Transient attribute: update
> …..

Not knowing much about the agent, it looks like it is setting readable=0 and master-db-mysql:1=0 - presumably due to a failure (real or imagined).
This is then triggering pacemaker to attempt to recover the slaves.

>  
> There is a transaction abort, and shortly after this, the slaves are restarted:
>  
>  
> ….
> Sep 10 01:52:45 oamgr pengine: [3384]: notice: LogActions: Move    db-mysql:1   (Slave oadb2 -> huoadb1)
> Sep 10 01:52:45 oamgr pengine: [3384]: notice: LogActions: Move    db-mysql:2   (Slave huoadb1 -> oadb2)
> Sep 10 01:52:45 oamgr crmd: [3385]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Sep 10 01:52:45 oamgr crmd: [3385]: info: do_te_invoke: Processing graph 71362 (ref=pe_calc-dc-1378777965-165981) derived from /var/lib/pengine/pe-input-3180.bz2
> Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 148: notify db-mysql:0_pre_notify_stop_0 on oadb1
> Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 150: notify db-mysql:1_pre_notify_stop_0 on oadb2
> Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 151: notify db-mysql:2_pre_notify_stop_0 on huoadb1
> Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 152: notify db-mysql:3_pre_notify_stop_0 on huoadb2
> Sep 10 01:52:45 oamgr pengine: [3384]: notice: process_pe_message: Transition 71362: PEngine Input stored in: /var/lib/pengine/pe-input-3180.bz2
> Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 39: stop db-mysql:1_stop_0 on oadb2
> Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 43: stop db-mysql:2_stop_0 on huoadb1
> ….
>  
> It appears that oadb2 and huoadb1 are replaced with each other (in terms of db-mysql:1 and db-mysql:2 )? Does that make any sense?
>  
> It happens only when I have all 4 mysql nodes online. (oadb1, oadb2, huoadb1, huoadb2). When I moved oadb2 to standby for a day, I did not see restarts.
>  
> Could someone help me troubleshoot this?

I'd look into why the agent might be setting these values and strongly consider updating pacemaker.
1.1.7 is showing its age

>  
>  
> Mysql version is 5.1.66
> Pacemaker 1.1.7
> Corosync 1.4.2
> Mysql RA is the latest from github
>  
>  
> Thanks in advance,
>  
> Attila
>  
>  
>  
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130919/40cb424d/attachment-0003.sig>


More information about the Pacemaker mailing list