[ClusterLabs] killing postgres wal progress in slave makes cluster crashed

Wed May 2 12:35:01 EDT 2018

On Sat, 2018-04-28 at 08:46 +0000, 范国腾 wrote:
> Hi,
> There are three nodes: node1,node2,node3。node1 is master, node2 and
> node3 is slave。
> We execute the “truncate table” in 14:25:36 and kill the WAL progress
> in db2。Then the db2 pacemaker is down and db1 is reboot.
> But I could not find any information from the /var/log/messages. The
> flowing is log, could you help find any clue?
>  
>  
> Current DC: db3 (version 1.1.15-11.el7-e174ec8) - partition with
> quorum

The DC is the node that schedules all actions, so its logs will be
helpful too. If there was any fencing, it should be mentioned there.

I don't see anything obvious in the logs you've posted here. But if the
nodes were fenced or crashed, the most recent logs may have not been
written to disk yet.

> Last updated: Sat Apr 28 16:37:53 2018          Last change: Sat Apr
> 28 16:02:25 2018 by hacluster via crmd on db3
>  
> 3 nodes and 19 resources configured
>  
> Node db2: pending
> Online: [ db3 ]
> OFFLINE: [ db1 ]
>  
> Full list of resources:
>  
> ipmi_node1     (stonith:fence_ipmilan):        Started db3
> ipmi_node2     (stonith:fence_ipmilan):        Started db3
> ipmi_node3     (stonith:fence_ipmilan):        Stopped
> Clone Set: dlm-clone [dlm]
>      Started: [ db3 ]
>      Stopped: [ db1 db2 ]
> Clone Set: clvmd-clone [clvmd]
>      Started: [ db3 ]
>      Stopped: [ db1 db2 ]
> Clone Set: clusterfs-clone [clusterfs]
>      Started: [ db3 ]
>      Stopped: [ db1 db2 ]
> Master/Slave Set: pgsql-ha [pgsqld]
>      Masters: [ db3 ]
>      Stopped: [ db1 db2 ]
> Resource Group: mastergroup
>      master-vip (ocf::heartbeat:IPaddr2):       Started db3
>      rep-vip    (ocf::heartbeat:IPaddr2):       Started db3
> slave1-vip     (ocf::heartbeat:IPaddr2):       Stopped
> slave2-vip     (ocf::heartbeat:IPaddr2):       Stopped
>  
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
>  
> DB1 /var/log/messages
> 
>  
> DB2 /var/log/messages
> 
>  
>  
> DB1 postgres log
> 
> DB2 postgres log
> 
>  
> 发件人: 徐晓菲 
> 发送时间: 2018年4月27日 9:57
> 收件人: 邵大明 <shaodaming at highgo.com>; 范国腾 <fanguoteng at highgo.com>; 王亮 <wa
> ngliang at highgo.com>
> 主题: 回复: 回复: message+pglog
>  
> 嗯嗯，知道了。
>  
> 还有昨天邮件发log的那个问题，不知道是不是跟truncate tb有关，因为跟下面这种情况一样都做过truncate tb
>  
> 这还有一种情况：
> 操作步骤:
> （1）1主2备（db1主  db2备  db3备），psql -h master-vip 
> （2）create tb1；  insert tb1执行中
> （3）kill一个备机（db3）的流复制进程
> （4）该备机重启流复制进程，pcs status仍为原有的1主2备
>  
> （5）truncate tb1
> （6）重新kill一个备机（db3）的流复制进程（没有执行insert）
> （7）原主机db1被关机
> （8）db2上执行pcs status和查看进程
> [root at sds2 ~]# pcs status
> Cluster name: hgpurog
> Stack: corosync
> Current DC: db2 (version 1.1.15-11.el7-e174ec8) - partition with
> quorum
> Last updated: Fri Apr 27 09:47:15 2018      Last change: Fri Apr 27
> 09:28:01 2018 by root via crm_attribute on db1
>  
> 3 nodes and 19 resources configured
>  
> Node db3: pending
> Online: [ db2 ]
> OFFLINE: [ db1 ]
>  
> Full list of resources:
>  
>  ipmi_node1    (stonith:fence_ipmilan):    Started db2
>  ipmi_node2    (stonith:fence_ipmilan):    Stopped
>  ipmi_node3    (stonith:fence_ipmilan):    Started db2
>  Clone Set: dlm-clone [dlm]
>      Started: [ db2 ]
>      Stopped: [ db1 db3 ]
>  Clone Set: clvmd-clone [clvmd]
>      Started: [ db2 ]
>      Stopped: [ db1 db3 ]
>  Clone Set: clusterfs-clone [clusterfs]
>      Started: [ db2 ]
>      Stopped: [ db1 db3 ]
>  Master/Slave Set: pgsql-ha [pgsqld]
>      Slaves: [ db2 ]
>      Stopped: [ db1 db3 ]
>  Resource Group: mastergroup
>      master-vip  (ocf::heartbeat:IPaddr2):   Stopped
>      rep-vip (ocf::heartbeat:IPaddr2):   Stopped
>  slave1-vip    (ocf::heartbeat:IPaddr2):   Stopped
>  slave2-vip    (ocf::heartbeat:IPaddr2):   Stopped
>  
> Failed Actions:
> * pgsqld_promote_0 on db2 'unknown error' (1): call=94, status=Timed
> Out, exitreason='none',
>     last-rc-change='Fri Apr 27 09:36:06 2018', queued=0ms,
> exec=300002ms
>  
>  
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
> [root at sds2 ~]# 
>  
> [highgo at sds2 data]$ ps -ef|grep postgres
> highgo   29499 28255  0 09:51 pts/1    00:00:00 grep --color=auto
> postgres
>  
> db3上pcs staus 和查看进程
> [root at sds3 ~]# pcs status
> Error: cluster is not currently running on this node
>  
> [root at sds3 ~]# ps -ef |grep postgres
> highgo    4388     1  0 09:19 ?        00:00:13
> /home/highgo/hgdb/bin/postgres -D /home/highgo/hgdb/data
> highgo    4449  4388  0 09:19 ?        00:00:00 postgres: logger
> process   
> highgo   10723  4388  2 09:28 ?        00:00:35 postgres: startup
> process   recovering 0000000900000000000000EF
> highgo   10732  4388  0 09:28 ?        00:00:00 postgres:
> checkpointer process   
> highgo   10733  4388  0 09:28 ?        00:00:00 postgres: writer
> process   
> highgo   11261  4388  0 09:28 ?        00:00:00 postgres: stats
> collector process   
> highgo   12229  4388  0 09:50 ?        00:00:00 postgres: wal
> receiver process   
> root     12231 17313  0 09:50 pts/0    00:00:00 grep --color=auto
> postgres
> [root at sds3 ~]# 
>  
>  
> 祝工作顺利！
> ----------------------------------
> 徐晓菲  产品检测部
> 瀚高基础软件股份有限公司
> 网址：www.highgo.com 
> 地址：济南市高新区新泺大街2117号铭盛大厦20层
> 手机：183-6307-3951  邮箱：xuxiaofei at highgo.com
>  
>  
> 发件人： shaodaming at highgo.com
> 发送时间： 2018-04-27 09:37
> 收件人： xuxiaofei at highgo.com; fanguoteng; 王亮
> 主题： 回复: 回复: message+pglog
> hi, xiaofei
>  
> 交叉就是， 如果两个机器作为client server.
> 一个机器建立400个client访问 备1 数据库1
> 一个机器建立400 个client 访问 备 2 数据库2
> 交叉10% 就是 360个访问备1的数据库1， 40个访问备1的数据库2.
>                 就是 360个访问备2的数据库2， 40个访问备2的数据库1.
> 其他的情况类似按比例改变如上
>  
> thanks.
> Br.
> Bret
> shaodaming at highgo.com
>  
> 发件人： xuxiaofei at highgo.com
> 发送时间： 2018-04-27 09:18
> 收件人： 范国腾; wangliang; shaodaming
> 主题： 回复: message+pglog
> 哈喽
>     这里的交叉是指，比如100%交叉是同时发select，比如10%交叉是备一读一段时间之后，备二再读 么
>  
>  
> 祝工作顺利！
> ----------------------------------
> 徐晓菲  产品检测部
> 瀚高基础软件股份有限公司
> 网址：www.highgo.com 
> 地址：济南市高新区新泺大街2117号铭盛大厦20层
> 手机：183-6307-3951  邮箱：xuxiaofei at highgo.com
>  
>  
> 发件人： xuxiaofei at highgo.com
> 发送时间： 2018-04-26 16:33
> 收件人： 范国腾; wangliang; shaodaming
> 主题： message+pglog
>  
>  
>  
> 祝工作顺利！
> ----------------------------------
> 徐晓菲  产品检测部
> 瀚高基础软件股份有限公司
> 网址：www.highgo.com 
> 地址：济南市高新区新泺大街2117号铭盛大厦20层
> 手机：183-6307-3951  邮箱：xuxiaofei at highgo.com
>  
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>