[ClusterLabs] Cluster fails to start on rebooted nodes without manual fiddling...
Casey & Gina
caseyandgina at icloud.com
Mon Apr 2 16:05:09 EDT 2018
Hi!
I've set up a couple test Pacemaker/Corosync/PCS clusters on virtual machines to manage a PostgreSQL service using the PostgreSQL Automatic Failover (PAF) resource agent available here: https://clusterlabs.github.io/PAF/
The cluster will be up and running with 3 nodes and PostgreSQL running successfully. However I've run into a nasty stumbling block when attempting to restart a node or even just issuing `pcs cluster stop` and `pcs cluster start` on a node.
The operating system on the nodes is Ubuntu 16.04.
Firstly, here's my configuration (using `crm configure show`):
------
node 1: d-gp2-dbp62-1 \
attributes master-postgresql-10-main=1001
node 2: d-gp2-dbp62-2 \
attributes master-postgresql-10-main=1000
node 3: d-gp2-dbp62-3 \
attributes master-postgresql-10-main=990
primitive postgresql-10-main pgsqlms \
params bindir="/usr/lib/postgresql/10/bin" pgdata="/var/lib/postgresql/10/main" pghost="/var/run/postgresql/10/main" pgport=5432 recovery_template="/etc/postgresql/10/main/recovery.conf.pcmk" start_opts="-c config_file=/etc/postgresql/10/main/postgresql.conf" \
op start interval=0s timeout=60s \
op stop interval=0s timeout=60s \
op promote interval=0s timeout=30s \
op demote interval=0s timeout=120s \
op monitor interval=15s role=Master timeout=10s \
op monitor interval=16s role=Slave timeout=10s \
op notify interval=0s timeout=60s
primitive postgresql-master-vip IPaddr2 \
params ip=10.124.167.174 cidr_netmask=22 \
op start interval=0s timeout=20s \
op stop interval=0s timeout=20s \
op monitor interval=10s
ms postgresql-ha postgresql-10-main \
meta notify=true target-role=Master
order demote-postgresql-then-remove-vip Mandatory: postgresql-ha:demote postgresql-master-vip:stop symmetrical=false
colocation postgresql-master-with-vip inf: postgresql-master-vip:Started postgresql-ha:Master
order promote-postgresql-then-add-vip Mandatory: postgresql-ha:promote postgresql-master-vip:start symmetrical=false
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync \
cluster-name=d-gp2-dbp62 \
stonith-enabled=false
rsc_defaults rsc_defaults-options: \
migration-threshold=5 \
resource-stickiness=10
------
With the resource running on the first node, here's the output of `pcs status`, and everything looks great:
------
Cluster name: d-gp2-dbp62
Last updated: Mon Apr 2 19:43:52 2018 Last change: Mon Apr 2 19:41:35 2018 by root via crm_resource on d-gp2-dbp62-1
Stack: corosync
Current DC: d-gp2-dbp62-3 (version 1.1.14-70404b0) - partition with quorum
3 nodes and 4 resources configured
Online: [ d-gp2-dbp62-1 d-gp2-dbp62-2 d-gp2-dbp62-3 ]
Full list of resources:
postgresql-master-vip (ocf::heartbeat:IPaddr2): Started d-gp2-dbp62-1
Master/Slave Set: postgresql-ha [postgresql-10-main]
Masters: [ d-gp2-dbp62-1 ]
Slaves: [ d-gp2-dbp62-2 d-gp2-dbp62-3 ]
------
Now, if I restart the second node, and execute `pcs cluster start` once it's back up, it fails to start the resource and shows me this in the `pcs status` output:
------
* postgresql-10-main_start_0 on d-gp2-dbp62-2 'unknown error' (1): call=11, status=complete, exitreason='Instance "postgresql-10-main" failed to start (rc: 1)',
last-rc-change='Mon Apr 2 19:50:40 2018', queued=0ms, exec=228ms
------
Looking in corosync.log, I see the following:
------
Apr 02 19:50:40 [2014] d-gp2-dbp62-2 lrmd: info: log_execute: executing - rsc:postgresql-10-main action:start call_id:11
pgsqlms(postgresql-10-main)[2092]: 2018/04/02_19:50:40 ERROR: Instance "postgresql-10-main" failed to start (rc: 1)
Apr 02 19:50:40 [2014] d-gp2-dbp62-2 lrmd: notice: operation_finished: postgresql-10-main_start_0:2092:stderr [ pg_ctl: could not start server ]
Apr 02 19:50:40 [2014] d-gp2-dbp62-2 lrmd: notice: operation_finished: postgresql-10-main_start_0:2092:stderr [ Examine the log output. ]
Apr 02 19:50:40 [2014] d-gp2-dbp62-2 lrmd: notice: operation_finished: postgresql-10-main_start_0:2092:stderr [ ocf-exit-reason:Instance "postgresql-10-main" failed to start (rc: 1) ]
Apr 02 19:50:40 [2014] d-gp2-dbp62-2 lrmd: info: log_finished: finished - rsc:postgresql-10-main action:start call_id:11 pid:2092 exit-code:1 exec-time:228ms queue-time:0ms
Apr 02 19:50:40 [2017] d-gp2-dbp62-2 crmd: info: action_synced_wait: Managed pgsqlms_meta-data_0 process 2115 exited with rc=0
Apr 02 19:50:40 [2017] d-gp2-dbp62-2 crmd: notice: process_lrm_event: Operation postgresql-10-main_start_0: unknown error (node=d-gp2-dbp62-2, call=11, rc=1, cib-update=15, confirmed=true)
Apr 02 19:50:40 [2017] d-gp2-dbp62-2 crmd: notice: process_lrm_event: d-gp2-dbp62-2-postgresql-10-main_start_0:11 [ pg_ctl: could not start server\nExamine the log output.\nocf-exit-reason:Instance "postgresql-10-main" failed to start (rc: 1)\n ]
------
It tells me to examine the PostgreSQL log output, so I look there, but I don't see anything logged at all since the server shutdown. There are no new log entries indicating that a start was attempted. I can repeat the `pcs cluster start/stop` commands any number of times, and the problem remains.
Now for the weird part, which is the workaround I inadvertently discovered when trying to figure out what was going wrong. First, I do a `pcs cluster stop` again. Then I start up the service using systemd, and it starts up just fine; so I do that and then shut it back down using `service postgresql at 10-main start` and `service postgresql at 10-main stop`, which should put it back in the original state. Now, when I issue `pcs cluster start`, everything comes up just fine as expected with no errors!?!
I have also seen this "unknown error" come up at other undesirable times, like when doing a manual failover using `pcs cluster stop` on the primary or a `pcs resource move --master ...` command, however once the workaround is applied to the node having issues, it works perfectly fine until it's rebooted.
Can anyone explain what is happening here, and how I can fix it properly?
Best wishes,
--
Casey
More information about the Users
mailing list