[Pacemaker] MySQL, Percona replication manager - split brain

Sun Oct 26 15:18:51 EDT 2014

26.10.2014 17:44, Andrei Borzenkov пишет:
> В Sun, 26 Oct 2014 10:51:13 +0200
> Andrew <nitr0 at seti.kr.ua> пишет:
>
>> 26.10.2014 08:32, Andrei Borzenkov пишет:
>>> В Sat, 25 Oct 2014 23:34:54 +0300
>>> Andrew <nitr0 at seti.kr.ua> пишет:
>>>
>>>> 25.10.2014 22:34, Digimer пишет:
>>>>> On 25/10/14 03:32 PM, Andrew wrote:
>>>>>> Hi all.
>>>>>>
>>>>>> I use Percona as RA on cluster (nothing mission-critical, currently -
>>>>>> just zabbix data); today after restarting MySQL resource (crm resource
>>>>>> restart p_mysql) I've got a split brain state - MySQL for some reason
>>>>>> started first at ex-slave node, ex-master starts later (possibly I've
>>>>>> set too small timeout to shutdown - only 120s, but I'm not sure).
>>>>>>
> Your logs do not show resource restart - they show pacemaker restart on
> node2.
Yes, you're right. This was a pacemaker restart.
>
>>>>>> After restart resource on both nodes it seems like mysql replication was
>>>>>> ok - but then after ~50min it fails in split brain again for unknown
>>>>>> reason (no resource restart was noticed).
>>>>>>
>>>>>> In 'show replication status' there is an error in table caused by unique
>>>>>> index dup.
>>>>>>
>>>>>> So I have a questions:
>>>>>> 1) Which thing causes split brain, and how to avoid it in future?
>>>>> Cause:
>>>>>
>>>>> Logs?
>>>> ct 25 13:54:13 node2 crmd[29248]:   notice: do_state_transition: State
>>>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
>>>> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>>>> Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_config: On loss
>>>> of CCM Quorum: Ignore
>>>> Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_rsc_op: Operation
>>>> monitor found resource p_pgsql:0 active in master mode on node1.cluster
>>>> Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_rsc_op: Operation
>>>> monitor found resource p_mysql:1 active in master mode on node2.cluster
>>> That seems too late. The real cause is that resource was reported as
>>> being in master state on both nodes and this happened earlier.
>> This is a different resources (pgsql and mysql)/
>>
>>>>> Prevent:
>>>>>
>>>>> Fencing (aka stonith). This is why fencing is required.
>>>> No node failure. Just daemon was restarted.
>>>>
>>> "Split brain" == loss of communication. It does not matter whether
>>> communication was lost because node failed or because daemon was not
>>> running. There is no way for surviving node to know, *why*
>>> communication was lost.
>>>
>> So how stonith will help in this case? Daemon will be restarted after
>> it's death if it occures during restart, and stonith will see alive
>> daemon...
>>
>> So what is the easiest split-brain solution? Just to stop daemons, and
>> copy all mysql data from good node to bad one?
> There is no split-brain visible in your log. Pacemaker on node2 was
> restarted, cleanly as far as I can tell, and reintegrated back in
> cluster. May be node1 "lost" node2, but that needs logs from node1.

Here is log from other node:

Oct 25 13:54:13 node1 pacemakerd[21773]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:13 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-p_mysql (56)
Oct 25 13:54:13 node1 attrd[26079]:   notice: attrd_perform_update: Sent 
update 6993: master-p_mysql=56
Oct 25 13:54:16 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-p_mysql (53)
Oct 25 13:54:16 node1 attrd[26079]:   notice: attrd_perform_update: Sent 
update 6995: master-p_mysql=53
Oct 25 13:54:16 node1 pacemakerd[22035]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:18 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-p_mysql (60)
Oct 25 13:54:18 node1 attrd[26079]:   notice: attrd_perform_update: Sent 
update 6997: master-p_mysql=60
Oct 25 13:54:18 node1 pacemakerd[22335]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:19 node1 pacemakerd[22476]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:19 node1 mysql(p_mysql)[22446]: INFO: Ignoring post-demote 
notification execpt for my own demotion.
Oct 25 13:54:19 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_mysql_notify_0 (call=2423, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:54:19 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_pgsql_notify_0 (call=2425, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:54:20 node1 pacemakerd[22597]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:20 node1 mysql(p_mysql)[22540]: INFO: Ignoring post-demote 
notification execpt for my own demotion.
Oct 25 13:54:20 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_mysql_notify_0 (call=2433, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:54:20 node1 IPaddr(ClusterIP)[22538]: INFO: Adding inet 
address 192.168.253.254/24 with broadcast address 192.168.253.255 to 
device br0
Oct 25 13:54:20 node1 IPaddr2(pgsql_reader_vip)[22539]: INFO: Adding 
inet address 192.168.253.31/24 with broadcast address 192.168.253.255 to 
device br0
Oct 25 13:54:20 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_pgsql_notify_0 (call=2435, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:54:20 node1 IPaddr(ClusterIP)[22538]: INFO: Bringing device br0 up
Oct 25 13:54:20 node1 IPaddr2(pgsql_reader_vip)[22539]: INFO: Bringing 
device br0 up
Oct 25 13:54:20 node1 IPaddr(ClusterIP)[22538]: INFO: 
/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p 
/var/run/resource-agents/send_arp-192.168.253.254 br0 192.168.253.254 
auto not_used not_used
Oct 25 13:54:20 node1 IPaddr2(pgsql_reader_vip)[22539]: INFO: 
/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p 
/var/run/resource-agents/send_arp-192.168.253.31 br0 192.168.253.31 auto 
not_used not_used
Oct 25 13:54:20 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation ClusterIP_start_0 (call=2429, rc=0, cib-update=653, 
confirmed=true) ok
Oct 25 13:54:20 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation pgsql_reader_vip_start_0 (call=2431, rc=0, cib-update=654, 
confirmed=true) ok
Oct 25 13:54:20 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation ClusterIP_monitor_2000 (call=2441, rc=0, cib-update=655, 
confirmed=false) ok
Oct 25 13:54:20 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation pgsql_reader_vip_monitor_10000 (call=2443, rc=0, 
cib-update=656, confirmed=false) ok
Oct 25 13:54:20 node1 mysql(p_mysql)[22808]: INFO: post-demote 
notification for node2.cluster.
Oct 25 13:54:20 node1 mysql(p_mysql)[22808]: INFO: MySQL slave has 
finished reading master binary log
Oct 25 13:54:20 node1 mysql(p_mysql)[22808]: INFO: MySQL slave has 
finished processing relay log
Oct 25 13:54:20 node1 pacemakerd[22948]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:20 node1 pgsql(p_pgsql)[22937]: INFO: Changing 
p_pgsql-data-status on node2.cluster : STREAMING|SYNC->DISCONNECT.
Oct 25 13:54:21 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_mysql_notify_0 (call=2445, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:54:21 node1 pgsql(p_pgsql)[22937]: INFO: Setup node2.cluster 
into async mode.
Oct 25 13:54:21 node1 pgsql(p_pgsql)[22937]: INFO: server signaled
Oct 25 13:54:21 node1 pgsql(p_pgsql)[22937]: INFO: Reload configuration 
file.
Oct 25 13:54:21 node1 pacemakerd[23094]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:21 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_pgsql_notify_0 (call=2450, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:54:21 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_mysql_notify_0 (call=2453, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:54:22 node1 ntpd[1526]: Listen normally on 102 br0 
192.168.253.254 UDP 123
Oct 25 13:54:22 node1 ntpd[1526]: Listen normally on 103 br0 
192.168.253.31 UDP 123
Oct 25 13:54:22 node1 ntpd[1526]: peers refreshed
Oct 25 13:54:23 node1 pacemakerd[23336]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:25 node1 pacemakerd[23575]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:27 node1 pacemakerd[23851]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:29 node1 pacemakerd[24006]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:32 node1 pacemakerd[24205]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:34 node1 pacemakerd[24541]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:36 node1 pacemakerd[24705]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:38 node1 pacemakerd[24912]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:41 node1 pacemakerd[25150]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:43 node1 pacemakerd[25440]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:45 node1 pacemakerd[25641]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:47 node1 pacemakerd[25863]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:50 node1 pacemakerd[26016]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:52 node1 pacemakerd[26230]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:54 node1 pacemakerd[26563]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:56 node1 pacemakerd[26780]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:58 node1 pacemakerd[27008]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:01 node1 pacemakerd[27208]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:03 node1 pacemakerd[27540]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:05 node1 pacemakerd[27696]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:07 node1 pacemakerd[27958]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:10 node1 pacemakerd[28111]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:12 node1 pacemakerd[28310]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:14 node1 pacemakerd[28641]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:17 node1 pacemakerd[28812]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:19 node1 pacemakerd[29019]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:21 node1 pacemakerd[29219]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:23 node1 pacemakerd[29551]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:25 node1 pacemakerd[29706]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:28 node1 pacemakerd[29978]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:30 node1 pacemakerd[30177]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:32 node1 pacemakerd[30420]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:34 node1 pacemakerd[30708]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:37 node1 pacemakerd[30871]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:38 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_mysql_notify_0 (call=2456, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:55:38 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_mysql_notify_0 (call=2459, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:55:38 node1 mysql(p_mysql)[31063]: WARNING: Attempted to unset 
the replication master on an instance that is not configured as a 
replication slave
Oct 25 13:55:38 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-p_mysql (1060)
Oct 25 13:55:38 node1 attrd[26079]:   notice: attrd_perform_update: Sent 
update 7016: master-p_mysql=1060
Oct 25 13:55:38 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_mysql_promote_0 (call=2462, rc=0, cib-update=657, 
confirmed=true) ok
Oct 25 13:55:38 node1 mysql(p_mysql)[31139]: INFO: This will be the new 
master, ignoring post-promote notification.
Oct 25 13:55:38 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_mysql_notify_0 (call=2465, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:55:38 node1 lrmd[26078]:   notice: operation_finished: 
p_mysql_monitor_5000:31173:stderr [ resource ms_MySQL is NOT running ]
Oct 25 13:55:38 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_mysql_monitor_5000 (call=2468, rc=8, cib-update=658, 
confirmed=false) master
Oct 25 13:55:38 node1 IPaddr2(mysql_writer_vip)[31260]: INFO: Adding 
inet address 192.168.253.64/24 with broadcast address 192.168.253.255 to 
device br0
Oct 25 13:55:38 node1 IPaddr2(mysql_writer_vip)[31260]: INFO: Bringing 
device br0 up
Oct 25 13:55:38 node1 IPaddr2(mysql_writer_vip)[31260]: INFO: 
/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p 
/var/run/resource-agents/send_arp-192.168.253.64 br0 192.168.253.64 auto 
not_used not_used
Oct 25 13:55:38 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation mysql_writer_vip_start_0 (call=2470, rc=0, cib-update=659, 
confirmed=true) ok
Oct 25 13:55:38 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation mysql_writer_vip_monitor_10000 (call=2474, rc=0, 
cib-update=660, confirmed=false) ok
Oct 25 13:55:38 node1 crmd[26081]:   notice: peer_update_callback: Our 
peer on the DC is dead
Oct 25 13:55:38 node1 crmd[26081]:   notice: do_state_transition: State 
transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION 
cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
Oct 25 13:55:38 node1 crmd[26081]:   notice: do_state_transition: State 
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
cause=C_FSA_INTERNAL origin=do_election_check ]
Oct 25 13:55:38 node1 cib[26076]:   notice: cib:diff: Diff: --- 0.322.7
Oct 25 13:55:38 node1 cib[26076]:   notice: cib:diff: Diff: +++ 0.323.1 
b52f4b73ea1a2704d36b65af7364a3f8
Oct 25 13:55:38 node1 cib[26076]:   notice: cib:diff: -- <nvpair 
value="1.1.8-7.el6-394e906" id="cib-bootstrap-options-dc-version"/>
Oct 25 13:55:38 node1 cib[26076]:   notice: cib:diff: ++ <nvpair 
id="cib-bootstrap-options-dc-version" name="dc-version" 
value="1.1.10-14.el6_5.3-368c726"/>
Oct 25 13:55:38 node1 attrd[26079]:   notice: attrd_local_callback: 
Sending full refresh (origin=crmd)
Oct 25 13:55:38 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-p_mysql (1060)
Oct 25 13:55:39 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-p_pgsql (1000)
Oct 25 13:55:39 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: p_pgsql-master-baseline 
(000000000700FB08)
Oct 25 13:55:39 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: readable (1)
Oct 25 13:55:39 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: p_pgsql-status (PRI)
Oct 25 13:55:39 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: probe_complete (true)
Oct 25 13:55:39 node1 pacemakerd[31397]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:40 node1 ntpd[1526]: Listen normally on 104 br0 
192.168.253.64 UDP 123
Oct 25 13:55:40 node1 ntpd[1526]: peers refreshed
Oct 25 13:55:41 node1 pacemakerd[31594]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:42 node1 attrd[26079]:   notice: attrd_local_callback: 
Sending full refresh (origin=crmd)
Oct 25 13:55:42 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-p_mysql (1060)
Oct 25 13:55:42 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-p_pgsql (1000)
Oct 25 13:55:42 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: p_pgsql-master-baseline 
(000000000700FB08)
Oct 25 13:55:42 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: readable (1)
Oct 25 13:55:42 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: p_pgsql-status (PRI)
Oct 25 13:55:42 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: probe_complete (true)
Oct 25 13:55:43 node1 pengine[26080]:   notice: unpack_config: On loss 
of CCM Quorum: Ignore
Oct 25 13:55:43 node1 pengine[26080]:   notice: unpack_rsc_op: Operation 
monitor found resource p_pgsql:0 active in master mode on node1.cluster
Oct 25 13:55:43 node1 pengine[26080]:   notice: LogActions: Start 
p_mysql:1#011(node2.cluster)
Oct 25 13:55:43 node1 pengine[26080]:   notice: LogActions: Start 
p_pgsql:1#011(node2.cluster)
Oct 25 13:55:43 node1 pengine[26080]:   notice: LogActions: Start 
p_nginx:1#011(node2.cluster)
Oct 25 13:55:43 node1 pengine[26080]:   notice: LogActions: Start 
p_perl-fpm:1#011(node2.cluster)
Oct 25 13:55:43 node1 pengine[26080]:   notice: LogActions: Move 
p_web_ip#011(Started node1.cluster -> node2.cluster)
Oct 25 13:55:43 node1 pengine[26080]:   notice: LogActions: Move 
p_radius_ip#011(Started node1.cluster -> node2.cluster)
Oct 25 13:55:43 node1 pengine[26080]:   notice: LogActions: Start 
p_radiusd:1#011(node2.cluster)
Oct 25 13:55:43 node1 pengine[26080]:   notice: process_pe_message: 
Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-547.bz2
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 17: monitor ClusterIP_monitor_0 on node2.cluster
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 18: monitor mysql_reader_vip_monitor_0 on node2.cluster
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 19: monitor mysql_writer_vip_monitor_0 on node2.cluster
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 20: monitor pgsql_reader_vip_monitor_0 on node2.cluster
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 21: monitor pgsql_writer_vip_monitor_0 on node2.cluster
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 22: monitor p_mysql:1_monitor_0 on node2.cluster
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 23: monitor p_pgsql:1_monitor_0 on node2.cluster
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 24: monitor p_nginx:1_monitor_0 on node2.cluster
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 25: monitor p_perl-fpm:1_monitor_0 on node2.cluster
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 26: monitor p_web_ip_monitor_0 on node2.cluster
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 27: monitor p_radius_ip_monitor_0 on node2.cluster
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 28: monitor p_radiusd:1_monitor_0 on node2.cluster
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 140: notify p_mysql_pre_notify_start_0 on node1.cluster (local)
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 151: notify p_pgsql_pre_notify_start_0 on node1.cluster (local)
Oct 25 13:55:43 node1 pacemakerd[31858]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:43 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_mysql_notify_0 (call=2477, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:55:43 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_pgsql_notify_0 (call=2479, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 16: probe_complete probe_complete on node2.cluster - no waiting
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 44: start p_mysql:1_start_0 on node2.cluster
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 74: start p_pgsql:1_start_0 on node2.cluster
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 108: start p_nginx:1_start_0 on node2.cluster
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 121: stop p_web_ip_stop_0 on node1.cluster (local)
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 124: stop p_radius_ip_stop_0 on node1.cluster (local)
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 129: start p_radiusd:1_start_0 on node2.cluster
Oct 25 13:55:43 node1 pacemakerd[32056]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:43 node1 IPaddr2(p_radius_ip)[32032]: INFO: IP status = ok, 
IP_CIP=
Oct 25 13:55:43 node1 IPaddr2(p_web_ip)[32030]: INFO: IP status = ok, 
IP_CIP=
Oct 25 13:55:43 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_radius_ip_stop_0 (call=2487, rc=0, cib-update=699, 
confirmed=true) ok
Oct 25 13:55:43 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_web_ip_stop_0 (call=2484, rc=0, cib-update=700, 
confirmed=true) ok
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 125: start p_radius_ip_start_0 on node2.cluster
Oct 25 13:55:43 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 122: start p_web_ip_start_0 on node2.cluster
Oct 25 13:55:44 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 110: monitor p_nginx:1_monitor_30000 on node2.cluster
Oct 25 13:55:44 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 109: monitor p_nginx:1_monitor_10000 on node2.cluster
Oct 25 13:55:44 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 111: start p_perl-fpm:1_start_0 on node2.cluster
Oct 25 13:55:44 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 126: monitor p_radius_ip_monitor_10000 on node2.cluster
Oct 25 13:55:44 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 123: monitor p_web_ip_monitor_10000 on node2.cluster
Oct 25 13:55:44 node1 lrmd[26078]:   notice: operation_finished: 
p_mysql_monitor_5000:31992:stderr [ resource ms_MySQL is NOT running ]
Oct 25 13:55:45 node1 ntpd[1526]: Deleting interface #100 lo, 
10.255.0.32#123, interface stats: received=0, sent=0, dropped=0, 
active_time=44998 secs
Oct 25 13:55:45 node1 ntpd[1526]: Deleting interface #18 lo, 
10.255.0.33#123, interface stats: received=0, sent=0, dropped=0, 
active_time=47428 secs
Oct 25 13:55:45 node1 ntpd[1526]: peers refreshed
Oct 25 13:55:45 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 152: notify p_pgsql_post_notify_start_0 on node1.cluster (local)
Oct 25 13:55:45 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 153: notify p_pgsql:1_post_notify_start_0 on node2.cluster
Oct 25 13:55:45 node1 pacemakerd[32302]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:45 node1 pgsql(p_pgsql)[32291]: INFO: Changing 
p_pgsql-data-status on node2.cluster : DISCONNECT->STREAMING|ASYNC.
Oct 25 13:55:45 node1 cib[26076]:   notice: cib:diff: Diff: --- 0.323.42
Oct 25 13:55:45 node1 cib[26076]:   notice: cib:diff: Diff: +++ 0.324.1 
5fa5332e25fae6e209da715c2ba76e73
Oct 25 13:55:45 node1 cib[26076]:   notice: cib:diff: -- <nvpair 
value="DISCONNECT" id="nodes-node2.cluster-p_pgsql-data-status"/>
Oct 25 13:55:45 node1 cib[26076]:   notice: cib:diff: ++ <nvpair 
id="nodes-node2.cluster-p_pgsql-data-status" name="p_pgsql-data-status" 
value="STREAMING|ASYNC"/>
Oct 25 13:55:45 node1 pgsql(p_pgsql)[32291]: INFO: Setup node2.cluster 
into sync mode.
Oct 25 13:55:45 node1 pgsql(p_pgsql)[32291]: INFO: server signaled
Oct 25 13:55:45 node1 pgsql(p_pgsql)[32291]: INFO: Reload configuration 
file.
Oct 25 13:55:45 node1 pgsql(p_pgsql)[32291]: INFO: Changing 
p_pgsql-status on node2.cluster : HS:alone->HS:async.
Oct 25 13:55:45 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_pgsql_notify_0 (call=2493, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:55:46 node1 pacemakerd[32448]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:46 node1 pgsql(p_pgsql)[32437]: INFO: Changing 
p_pgsql-data-status on node2.cluster : STREAMING|ASYNC->STREAMING|SYNC.
Oct 25 13:55:46 node1 cib[26076]:   notice: cib:diff: Diff: --- 0.324.2
Oct 25 13:55:46 node1 cib[26076]:   notice: cib:diff: Diff: +++ 0.325.1 
78a796d8af0b652875a2fff886b4b5bd
Oct 25 13:55:46 node1 cib[26076]:   notice: cib:diff: -- <nvpair 
value="STREAMING|ASYNC" id="nodes-node2.cluster-p_pgsql-data-status"/>
Oct 25 13:55:46 node1 cib[26076]:   notice: cib:diff: ++ <nvpair 
id="nodes-node2.cluster-p_pgsql-data-status" name="p_pgsql-data-status" 
value="STREAMING|SYNC"/>
Oct 25 13:55:46 node1 pgsql(p_pgsql)[32437]: INFO: Changing p_pgsql 
master score on node2.cluster : -INFINITY->100.
Oct 25 13:55:46 node1 pgsql(p_pgsql)[32437]: INFO: Changing 
p_pgsql-status on node2.cluster : HS:async->HS:sync.
Oct 25 13:55:48 node1 pacemakerd[32701]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:49 node1 lrmd[26078]:   notice: operation_finished: 
p_mysql_monitor_5000:444:stderr [ resource ms_MySQL is NOT running ]
Oct 25 13:55:49 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 141: notify p_mysql_post_notify_start_0 on node1.cluster (local)
Oct 25 13:55:49 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 142: notify p_mysql:1_post_notify_start_0 on node2.cluster
Oct 25 13:55:49 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_mysql_notify_0 (call=2496, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:55:49 node1 crmd[26081]:   notice: run_graph: Transition 0 
(Complete=50, Pending=0, Fired=0, Skipped=5, Incomplete=1, 
Source=/var/lib/pacemaker/pengine/pe-input-547.bz2): Stopped
Oct 25 13:55:49 node1 pengine[26080]:   notice: unpack_config: On loss 
of CCM Quorum: Ignore
Oct 25 13:55:49 node1 pengine[26080]:   notice: unpack_rsc_op: Operation 
monitor found resource p_pgsql:0 active in master mode on node1.cluster
Oct 25 13:55:49 node1 pengine[26080]:   notice: LogActions: Move 
mysql_reader_vip#011(Started node1.cluster -> node2.cluster)
Oct 25 13:55:49 node1 pengine[26080]:   notice: LogActions: Move 
pgsql_reader_vip#011(Started node1.cluster -> node2.cluster)
Oct 25 13:55:49 node1 pengine[26080]:   notice: process_pe_message: 
Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-548.bz2
Oct 25 13:55:49 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 22: stop mysql_reader_vip_stop_0 on node1.cluster (local)
Oct 25 13:55:49 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 27: stop pgsql_reader_vip_stop_0 on node1.cluster (local)
Oct 25 13:55:49 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 38: monitor p_mysql_monitor_2000 on node2.cluster
Oct 25 13:55:49 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 69: monitor p_pgsql_monitor_7000 on node2.cluster
Oct 25 13:55:49 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 106: monitor p_perl-fpm_monitor_10000 on node2.cluster
Oct 25 13:55:49 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 123: monitor p_radiusd_monitor_10000 on node2.cluster
Oct 25 13:55:49 node1 IPaddr2(mysql_reader_vip)[512]: INFO: IP status = 
ok, IP_CIP=
Oct 25 13:55:49 node1 IPaddr2(pgsql_reader_vip)[513]: INFO: IP status = 
ok, IP_CIP=
Oct 25 13:55:49 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation mysql_reader_vip_stop_0 (call=2500, rc=0, cib-update=702, 
confirmed=true) ok
Oct 25 13:55:49 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation pgsql_reader_vip_stop_0 (call=2503, rc=0, cib-update=703, 
confirmed=true) ok
Oct 25 13:55:49 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 23: start mysql_reader_vip_start_0 on node2.cluster
Oct 25 13:55:49 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 28: start pgsql_reader_vip_start_0 on node2.cluster
Oct 25 13:55:50 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 24: monitor mysql_reader_vip_monitor_10000 on node2.cluster
Oct 25 13:55:50 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 29: monitor pgsql_reader_vip_monitor_10000 on node2.cluster
Oct 25 13:55:50 node1 crmd[26081]:   notice: run_graph: Transition 1 
(Complete=11, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-548.bz2): Complete
Oct 25 13:55:50 node1 pengine[26080]:   notice: unpack_config: On loss 
of CCM Quorum: Ignore
Oct 25 13:55:50 node1 pengine[26080]:   notice: unpack_rsc_op: Operation 
monitor found resource p_pgsql:0 active in master mode on node1.cluster
Oct 25 13:55:50 node1 pengine[26080]:   notice: process_pe_message: 
Calculated Transition 2: /var/lib/pacemaker/pengine/pe-input-549.bz2
Oct 25 13:55:50 node1 crmd[26081]:   notice: run_graph: Transition 2 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-549.bz2): Complete
Oct 25 13:55:50 node1 crmd[26081]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 25 13:55:50 node1 pacemakerd[631]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:51 node1 ntpd[1526]: Deleting interface #103 br0, 
192.168.253.31#123, interface stats: received=0, sent=0, dropped=0, 
active_time=89 secs
Oct 25 13:55:51 node1 ntpd[1526]: Deleting interface #101 br0, 
192.168.253.63#123, interface stats: received=0, sent=0, dropped=0, 
active_time=42685 secs
Oct 25 13:55:51 node1 ntpd[1526]: peers refreshed
Oct 25 13:55:52 node1 crmd[26081]:   notice: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct 25 13:55:52 node1 pengine[26080]:   notice: unpack_config: On loss 
of CCM Quorum: Ignore
Oct 25 13:55:52 node1 pengine[26080]:   notice: unpack_rsc_op: Operation 
monitor found resource p_pgsql:0 active in master mode on node1.cluster
Oct 25 13:55:52 node1 pengine[26080]:   notice: process_pe_message: 
Calculated Transition 3: /var/lib/pacemaker/pengine/pe-input-550.bz2
Oct 25 13:55:52 node1 crmd[26081]:   notice: run_graph: Transition 3 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-550.bz2): Complete
Oct 25 13:55:52 node1 crmd[26081]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 25 13:55:53 node1 pacemakerd[846]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:54 node1 crmd[26081]:   notice: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct 25 13:55:54 node1 pengine[26080]:   notice: unpack_config: On loss 
of CCM Quorum: Ignore
Oct 25 13:55:54 node1 pengine[26080]:   notice: unpack_rsc_op: Operation 
monitor found resource p_pgsql:0 active in master mode on node1.cluster
Oct 25 13:55:54 node1 pengine[26080]:   notice: process_pe_message: 
Calculated Transition 4: /var/lib/pacemaker/pengine/pe-input-551.bz2
Oct 25 13:55:54 node1 crmd[26081]:   notice: run_graph: Transition 4 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-551.bz2): Complete
Oct 25 13:55:54 node1 crmd[26081]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 25 13:55:55 node1 pacemakerd[1123]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:56 node1 crmd[26081]:   notice: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct 25 13:55:56 node1 pengine[26080]:   notice: unpack_config: On loss 
of CCM Quorum: Ignore
Oct 25 13:55:56 node1 pengine[26080]:   notice: unpack_rsc_op: Operation 
monitor found resource p_pgsql:0 active in master mode on node1.cluster
Oct 25 13:55:56 node1 pengine[26080]:   notice: LogActions: Move 
mysql_reader_vip#011(Started node2.cluster -> node1.cluster)
Oct 25 13:55:56 node1 pengine[26080]:   notice: process_pe_message: 
Calculated Transition 5: /var/lib/pacemaker/pengine/pe-input-552.bz2
Oct 25 13:55:56 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 26: stop mysql_reader_vip_stop_0 on node2.cluster
Oct 25 13:55:56 node1 crmd[26081]:   notice: run_graph: Transition 5 
(Complete=1, Pending=0, Fired=0, Skipped=3, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-552.bz2): Stopped
Oct 25 13:55:56 node1 pengine[26080]:   notice: unpack_config: On loss 
of CCM Quorum: Ignore
Oct 25 13:55:56 node1 pengine[26080]:   notice: unpack_rsc_op: Operation 
monitor found resource p_pgsql:0 active in master mode on node1.cluster
Oct 25 13:55:56 node1 pengine[26080]:   notice: LogActions: Start 
mysql_reader_vip#011(node1.cluster)
Oct 25 13:55:56 node1 pengine[26080]:   notice: process_pe_message: 
Calculated Transition 6: /var/lib/pacemaker/pengine/pe-input-553.bz2
Oct 25 13:55:56 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 25: start mysql_reader_vip_start_0 on node1.cluster (local)
Oct 25 13:55:57 node1 IPaddr2(mysql_reader_vip)[1299]: INFO: Adding inet 
address 192.168.253.63/24 with broadcast address 192.168.253.255 to 
device br0
Oct 25 13:55:57 node1 IPaddr2(mysql_reader_vip)[1299]: INFO: Bringing 
device br0 up
Oct 25 13:55:57 node1 IPaddr2(mysql_reader_vip)[1299]: INFO: 
/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p 
/var/run/resource-agents/send_arp-192.168.253.63 br0 192.168.253.63 auto 
not_used not_used
Oct 25 13:55:57 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation mysql_reader_vip_start_0 (call=2509, rc=0, cib-update=709, 
confirmed=true) ok
Oct 25 13:55:57 node1 crmd[26081]:   notice: te_rsc_command: Initiating 
action 26: monitor mysql_reader_vip_monitor_10000 on node1.cluster (local)
Oct 25 13:55:57 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation mysql_reader_vip_monitor_10000 (call=2512, rc=0, 
cib-update=710, confirmed=false) ok
Oct 25 13:55:57 node1 crmd[26081]:   notice: run_graph: Transition 6 
(Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-553.bz2): Complete
Oct 25 13:55:57 node1 crmd[26081]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 25 13:55:57 node1 pacemakerd[1528]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:55:58 node1 ntpd[1526]: Listen normally on 105 br0 
192.168.253.63 UDP 123
Oct 25 13:55:58 node1 ntpd[1526]: peers refreshed

> You probably misuse "split brain" in this case. Split-brain means -
> nodes lost communication with each other, so each node is unaware of
> in which state resources on other node are. Here "nodes" means
> corosync/pacemaker. Not individual resources.
I mean mysql database split brain - databases on nodes are out of sync.