[ClusterLabs] killing postgres wal progress in slave makes cluster crashed
Ken Gaillot
kgaillot at redhat.com
Wed May 2 12:35:01 EDT 2018
On Sat, 2018-04-28 at 08:46 +0000, 范国腾 wrote:
> Hi,
> There are three nodes: node1,node2,node3。node1 is master, node2 and
> node3 is slave。
> We execute the “truncate table” in 14:25:36 and kill the WAL progress
> in db2。Then the db2 pacemaker is down and db1 is reboot.
> But I could not find any information from the /var/log/messages. The
> flowing is log, could you help find any clue?
>
>
> Current DC: db3 (version 1.1.15-11.el7-e174ec8) - partition with
> quorum
The DC is the node that schedules all actions, so its logs will be
helpful too. If there was any fencing, it should be mentioned there.
I don't see anything obvious in the logs you've posted here. But if the
nodes were fenced or crashed, the most recent logs may have not been
written to disk yet.
> Last updated: Sat Apr 28 16:37:53 2018 Last change: Sat Apr
> 28 16:02:25 2018 by hacluster via crmd on db3
>
> 3 nodes and 19 resources configured
>
> Node db2: pending
> Online: [ db3 ]
> OFFLINE: [ db1 ]
>
> Full list of resources:
>
> ipmi_node1 (stonith:fence_ipmilan): Started db3
> ipmi_node2 (stonith:fence_ipmilan): Started db3
> ipmi_node3 (stonith:fence_ipmilan): Stopped
> Clone Set: dlm-clone [dlm]
> Started: [ db3 ]
> Stopped: [ db1 db2 ]
> Clone Set: clvmd-clone [clvmd]
> Started: [ db3 ]
> Stopped: [ db1 db2 ]
> Clone Set: clusterfs-clone [clusterfs]
> Started: [ db3 ]
> Stopped: [ db1 db2 ]
> Master/Slave Set: pgsql-ha [pgsqld]
> Masters: [ db3 ]
> Stopped: [ db1 db2 ]
> Resource Group: mastergroup
> master-vip (ocf::heartbeat:IPaddr2): Started db3
> rep-vip (ocf::heartbeat:IPaddr2): Started db3
> slave1-vip (ocf::heartbeat:IPaddr2): Stopped
> slave2-vip (ocf::heartbeat:IPaddr2): Stopped
>
> Daemon Status:
> corosync: active/disabled
> pacemaker: active/disabled
> pcsd: active/enabled
>
> DB1 /var/log/messages
>
>
> DB2 /var/log/messages
>
>
>
> DB1 postgres log
>
> DB2 postgres log
>
>
> 发件人: 徐晓菲
> 发送时间: 2018年4月27日 9:57
> 收件人: 邵大明 <shaodaming at highgo.com>; 范国腾 <fanguoteng at highgo.com>; 王亮 <wa
> ngliang at highgo.com>
> 主题: 回复: 回复: message+pglog
>
> 嗯嗯,知道了。
>
> 还有昨天邮件发log的那个问题,不知道是不是跟truncate tb有关,因为跟下面这种情况一样都做过truncate tb
>
> 这还有一种情况:
> 操作步骤:
> (1)1主2备(db1主 db2备 db3备),psql -h master-vip
> (2)create tb1; insert tb1执行中
> (3)kill一个备机(db3)的流复制进程
> (4)该备机重启流复制进程,pcs status仍为原有的1主2备
>
> (5)truncate tb1
> (6)重新kill一个备机(db3)的流复制进程(没有执行insert)
> (7)原主机db1被关机
> (8)db2上执行pcs status和查看进程
> [root at sds2 ~]# pcs status
> Cluster name: hgpurog
> Stack: corosync
> Current DC: db2 (version 1.1.15-11.el7-e174ec8) - partition with
> quorum
> Last updated: Fri Apr 27 09:47:15 2018 Last change: Fri Apr 27
> 09:28:01 2018 by root via crm_attribute on db1
>
> 3 nodes and 19 resources configured
>
> Node db3: pending
> Online: [ db2 ]
> OFFLINE: [ db1 ]
>
> Full list of resources:
>
> ipmi_node1 (stonith:fence_ipmilan): Started db2
> ipmi_node2 (stonith:fence_ipmilan): Stopped
> ipmi_node3 (stonith:fence_ipmilan): Started db2
> Clone Set: dlm-clone [dlm]
> Started: [ db2 ]
> Stopped: [ db1 db3 ]
> Clone Set: clvmd-clone [clvmd]
> Started: [ db2 ]
> Stopped: [ db1 db3 ]
> Clone Set: clusterfs-clone [clusterfs]
> Started: [ db2 ]
> Stopped: [ db1 db3 ]
> Master/Slave Set: pgsql-ha [pgsqld]
> Slaves: [ db2 ]
> Stopped: [ db1 db3 ]
> Resource Group: mastergroup
> master-vip (ocf::heartbeat:IPaddr2): Stopped
> rep-vip (ocf::heartbeat:IPaddr2): Stopped
> slave1-vip (ocf::heartbeat:IPaddr2): Stopped
> slave2-vip (ocf::heartbeat:IPaddr2): Stopped
>
> Failed Actions:
> * pgsqld_promote_0 on db2 'unknown error' (1): call=94, status=Timed
> Out, exitreason='none',
> last-rc-change='Fri Apr 27 09:36:06 2018', queued=0ms,
> exec=300002ms
>
>
> Daemon Status:
> corosync: active/disabled
> pacemaker: active/disabled
> pcsd: active/enabled
> [root at sds2 ~]#
>
> [highgo at sds2 data]$ ps -ef|grep postgres
> highgo 29499 28255 0 09:51 pts/1 00:00:00 grep --color=auto
> postgres
>
> db3上pcs staus 和查看进程
> [root at sds3 ~]# pcs status
> Error: cluster is not currently running on this node
>
> [root at sds3 ~]# ps -ef |grep postgres
> highgo 4388 1 0 09:19 ? 00:00:13
> /home/highgo/hgdb/bin/postgres -D /home/highgo/hgdb/data
> highgo 4449 4388 0 09:19 ? 00:00:00 postgres: logger
> process
> highgo 10723 4388 2 09:28 ? 00:00:35 postgres: startup
> process recovering 0000000900000000000000EF
> highgo 10732 4388 0 09:28 ? 00:00:00 postgres:
> checkpointer process
> highgo 10733 4388 0 09:28 ? 00:00:00 postgres: writer
> process
> highgo 11261 4388 0 09:28 ? 00:00:00 postgres: stats
> collector process
> highgo 12229 4388 0 09:50 ? 00:00:00 postgres: wal
> receiver process
> root 12231 17313 0 09:50 pts/0 00:00:00 grep --color=auto
> postgres
> [root at sds3 ~]#
>
>
> 祝工作顺利!
> ----------------------------------
> 徐晓菲 产品检测部
> 瀚高基础软件股份有限公司
> 网址:www.highgo.com
> 地址:济南市高新区新泺大街2117号铭盛大厦20层
> 手机:183-6307-3951 邮箱:xuxiaofei at highgo.com
>
>
> 发件人: shaodaming at highgo.com
> 发送时间: 2018-04-27 09:37
> 收件人: xuxiaofei at highgo.com; fanguoteng; 王亮
> 主题: 回复: 回复: message+pglog
> hi, xiaofei
>
> 交叉就是, 如果两个机器作为client server.
> 一个机器建立400个client访问 备1 数据库1
> 一个机器建立400 个client 访问 备 2 数据库2
> 交叉10% 就是 360个访问备1的数据库1, 40个访问备1的数据库2.
> 就是 360个访问备2的数据库2, 40个访问备2的数据库1.
> 其他的情况类似按比例改变如上
>
> thanks.
> Br.
> Bret
> shaodaming at highgo.com
>
> 发件人: xuxiaofei at highgo.com
> 发送时间: 2018-04-27 09:18
> 收件人: 范国腾; wangliang; shaodaming
> 主题: 回复: message+pglog
> 哈喽
> 这里的交叉是指,比如100%交叉是同时发select,比如10%交叉是备一读一段时间之后,备二再读 么
>
>
> 祝工作顺利!
> ----------------------------------
> 徐晓菲 产品检测部
> 瀚高基础软件股份有限公司
> 网址:www.highgo.com
> 地址:济南市高新区新泺大街2117号铭盛大厦20层
> 手机:183-6307-3951 邮箱:xuxiaofei at highgo.com
>
>
> 发件人: xuxiaofei at highgo.com
> 发送时间: 2018-04-26 16:33
> 收件人: 范国腾; wangliang; shaodaming
> 主题: message+pglog
>
>
>
> 祝工作顺利!
> ----------------------------------
> 徐晓菲 产品检测部
> 瀚高基础软件股份有限公司
> 网址:www.highgo.com
> 地址:济南市高新区新泺大街2117号铭盛大厦20层
> 手机:183-6307-3951 邮箱:xuxiaofei at highgo.com
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list