[Pacemaker] Mysql multiple slaves, slaves restarting occasionally without a reason

Thu Sep 12 03:13:00 EDT 2013

No idea on this one?

Sent: Tuesday, September 10, 2013 8:07 AM
To: pacemaker at oss.clusterlabs.org
Subject: [Pacemaker] Mysql multiple slaves, slaves restarting occasionally without a reason

Hi,

We have a Mysql cluster which works fine when I have a single master ("A") and slave ("B"). Failover is almost immediate and I am happy with this approach.
When we configured two additional slaves, strange things start to happen. From time to time I am noticing that all slaves mysql instances are restarted and I cannot figure out why.

I tried to find out what is happening, and this is how far I got:

There is a repeating sequence in the DC, which looks like this when everything is fine:

Sep 10 01:45:42 oamgr crmd: [3385]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Sep 10 01:45:42 oamgr crmd: [3385]: info: do_te_invoke: Processing graph 71358 (ref=pe_calc-dc-1378777542-165977) derived from /var/lib/pengine/pe-input-3179.bz2
Sep 10 01:45:42 oamgr crmd: [3385]: notice: run_graph: ==== Transition 71358 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-3179.bz2): Complete
Sep 10 01:45:42 oamgr crmd: [3385]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Sep 10 01:47:42 oamgr crmd: [3385]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (120000ms)
Sep 10 01:47:42 oamgr crmd: [3385]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Sep 10 01:47:42 oamgr crmd: [3385]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
....

But

It looks somewhat different when I see the restarts:

....
Sep 10 01:51:42 oamgr crmd: [3385]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Sep 10 01:51:42 oamgr crmd: [3385]: info: do_te_invoke: Processing graph 71361 (ref=pe_calc-dc-1378777902-165980) derived from /var/lib/pengine/pe-input-3179.bz2
Sep 10 01:51:42 oamgr crmd: [3385]: notice: run_graph: ==== Transition 71361 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-3179.bz2): Complete
Sep 10 01:51:42 oamgr crmd: [3385]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Sep 10 01:52:45 oamgr crmd: [3385]: info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-oadb2-master-db-mysql.1, name=master-db-mysql:1, value=0, magic=NA, cib=0.4829.3480) : Transient attribute: update
Sep 10 01:52:45 oamgr crmd: [3385]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Sep 10 01:52:45 oamgr crmd: [3385]: info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-oadb2-readable, name=readable, value=0, magic=NA, cib=0.4829.3481) : Transient attribute: update
.....

There is a transaction abort, and shortly after this, the slaves are restarted:

....
Sep 10 01:52:45 oamgr pengine: [3384]: notice: LogActions: Move    db-mysql:1   (Slave oadb2 -> huoadb1)
Sep 10 01:52:45 oamgr pengine: [3384]: notice: LogActions: Move    db-mysql:2   (Slave huoadb1 -> oadb2)
Sep 10 01:52:45 oamgr crmd: [3385]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Sep 10 01:52:45 oamgr crmd: [3385]: info: do_te_invoke: Processing graph 71362 (ref=pe_calc-dc-1378777965-165981) derived from /var/lib/pengine/pe-input-3180.bz2
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 148: notify db-mysql:0_pre_notify_stop_0 on oadb1
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 150: notify db-mysql:1_pre_notify_stop_0 on oadb2
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 151: notify db-mysql:2_pre_notify_stop_0 on huoadb1
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 152: notify db-mysql:3_pre_notify_stop_0 on huoadb2
Sep 10 01:52:45 oamgr pengine: [3384]: notice: process_pe_message: Transition 71362: PEngine Input stored in: /var/lib/pengine/pe-input-3180.bz2
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 39: stop db-mysql:1_stop_0 on oadb2
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 43: stop db-mysql:2_stop_0 on huoadb1
....

It appears that oadb2 and huoadb1 are replaced with each other (in terms of db-mysql:1 and db-mysql:2 )? Does that make any sense?

It happens only when I have all 4 mysql nodes online. (oadb1, oadb2, huoadb1, huoadb2). When I moved oadb2 to standby for a day, I did not see restarts.

Could someone help me troubleshoot this?

Mysql version is 5.1.66
Pacemaker 1.1.7
Corosync 1.4.2
Mysql RA is the latest from github

Thanks in advance,

Attila

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130912/84a94fca/attachment-0003.html>