[ClusterLabs] data loss of network would cause Pacemaker exit abnormally
chenhj
chjischj at 163.com
Sun Aug 28 02:15:52 UTC 2016
Hi all,
When i use the following command to simulate data lost of network at one member of my 3 nodes Pacemaker+Corosync cluster,
sometimes it cause Pacemaker on another node exit.
tc qdisc add dev eth2 root netem loss 90%
Is there any method to avoid this proleam?
[root at node3 ~]# ps -ef|grep pacemaker
root 32540 1 0 00:57 ? 00:00:00 /usr/libexec/pacemaker/lrmd
189 32542 1 0 00:57 ? 00:00:00 /usr/libexec/pacemaker/pengine
root 33491 11491 0 00:58 pts/1 00:00:00 grep pacemaker
/var/log/cluster/corosync.log
------------------------------------------------
Aug 27 12:33:59 [46855] node3 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/attrd/230, version=10.657.19)
Aug 27 12:33:59 corosync [CPG ] chosen downlist: sender r(0) ip(192.168.125.129) ; members(old:2 left:1)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership: Node 2172496064 joined group pacemakerd (counter=12.0)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership: Node 2172496064 still member of group pacemakerd (peer=node2, counter=12.0)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_update_peer_proc: pcmk_cpg_membership: Node node2[2172496064] - corosync-cpg is now online
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership: Node 2273159360 still member of group pacemakerd (peer=node3, counter=12.1)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=19): Try again (6)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership: Node 2273159360 left group pacemakerd (peer=node3, counter=13.0)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_update_peer_proc: pcmk_cpg_membership: Node node3[2273159360] - corosync-cpg is now offline
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership: Node 2172496064 still member of group pacemakerd (peer=node2, counter=13.0)
Aug 27 12:33:59 [46849] node3 pacemakerd: error: pcmk_cpg_membership: We're not part of CPG group 'pacemakerd' anymore!
Aug 27 12:33:59 [46849] node3 pacemakerd: error: pcmk_cpg_dispatch: Evicted from CPG membership
Aug 27 12:33:59 [46849] node3 pacemakerd: error: mcp_cpg_destroy: Connection destroyed
Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_xml_cleanup: Cleaning up memory from libxml2
Aug 27 12:33:59 [46858] node3 attrd: error: crm_ipc_read: Connection to pacemakerd failed
Aug 27 12:33:59 [46858] node3 attrd: error: mainloop_gio_callback: Connection to pacemakerd[0x1255eb0] closed (I/O condition=17)
Aug 27 12:33:59 [46858] node3 attrd: crit: attrd_cs_destroy: Lost connection to Corosync service!
Aug 27 12:33:59 [46858] node3 attrd: notice: main: Exiting...
Aug 27 12:33:59 [46858] node3 attrd: notice: main: Disconnecting client 0x12579a0, pid=46860...
Aug 27 12:33:59 [46858] node3 attrd: error: attrd_cib_connection_destroy: Connection to the CIB terminated...
Aug 27 12:33:59 corosync [pcmk ] info: pcmk_ipc_exit: Client attrd (conn=0x1955f80, async-conn=0x1955f80) left
Aug 27 12:33:59 [46856] node3 stonith-ng: error: crm_ipc_read: Connection to pacemakerd failed
Aug 27 12:33:59 [46856] node3 stonith-ng: error: mainloop_gio_callback: Connection to pacemakerd[0x2314af0] closed (I/O condition=17)
Aug 27 12:33:59 [46856] node3 stonith-ng: error: stonith_peer_cs_destroy: Corosync connection terminated
Aug 27 12:33:59 [46856] node3 stonith-ng: info: stonith_shutdown: Terminating with 1 clients
Aug 27 12:33:59 [46856] node3 stonith-ng: info: cib_connection_destroy: Connection to the CIB closed.
...
please see corosynclog.txt for detail of log
[root at node3 ~]# cat /etc/corosync/corosync.conf
totem {
version: 2
secauth: off
interface {
member {
memberaddr: 192.168.125.134
}
member {
memberaddr: 192.168.125.129
}
member {
memberaddr: 192.168.125.135
}
ringnumber: 0
bindnetaddr: 192.168.125.135
mcastport: 5405
ttl: 1
}
transport: udpu
}
logging {
fileline: off
to_logfile: yes
to_syslog: no
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
service {
ver: 1
name: pacemaker
}
Environment:
[root at node3 ~]# rpm -q corosync
corosync-1.4.1-7.el6.x86_64
[root at node3 ~]# cat /etc/redhat-release
CentOS release 6.3 (Final)
[root at node3 ~]# pacemakerd -F
Pacemaker 1.1.14-1.el6 (Build: 70404b0)
Supporting v3.0.10: generated-manpages agent-manpages ascii-docs ncurses libqb-logging libqb-ipc nagios corosync-plugin cman acls
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160828/87adda1b/attachment-0003.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: corosynclog.txt
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160828/87adda1b/attachment-0003.txt>
More information about the Users
mailing list