[ClusterLabs] Fwd: Postgres pacemaker cluster failure

Danka Ivanović danka.ivanovic at gmail.com
Thu Apr 18 08:19:44 EDT 2019


Hi,

Can you help me with troubleshooting postgres pacemaker cluster failure?
Today cluster failed without promoting secondary to master. At the same
time appeared ldap time out.
Here are the logs, master was stopped by pacemaker at 10:03:40 AM UTC.
Thank you in advance.

corosync.log

Apr 17 10:03:34 master crmd[12481]: notice: State transition S_IDLE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Apr 17 10:03:34 master pengine[12480]: notice: On loss of CCM Quorum: Ignore
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for fencing-secondary on master: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for fencing-master on secondary: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for PGSQL:1 on secondary: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary
away from master after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away
from secondary after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: notice: Recover PGSQL:1 (Slave
secondary)
Apr 17 10:03:34 master pengine[12480]: notice: Calculated Transition 3461:
/var/lib/pacemaker/pengine/pe-input-58.bz2
Apr 17 10:03:34 master pengine[12480]: notice: On loss of CCM Quorum: Ignore
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for fencing-secondary on master: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for fencing-master on secondary: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for PGSQL:1 on secondary: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary
away from master after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away
from secondary after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from
secondary after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from
secondary after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary)
Apr 17 10:03:34 master pengine[12480]: notice: Calculated Transition 3462:
/var/lib/pacemaker/pengine/pe-input-59.bz2
Apr 17 10:03:40 master lrmd[12477]: warning: PGSQL_monitor_15000 process
(PID 32372) timed out
Apr 17 10:03:40 master lrmd[12477]: warning: PGSQL_monitor_15000:32372 -
timed out after 10000ms
Apr 17 10:03:40 master crmd[12481]: notice: Transition aborted by
PGSQL_monitor_15000 'modify' on master: Old event
(magic=2:1;8:7:8:319e4083-ccc0-440a-ae43-1bbd39275fe7, cib=0.93.14,
source=process_graph_event:605, 0)
Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated
(23321-32400-25)
Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32400]
Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589;
real_size:1052672; rb->word_size:263168
Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ]
shm size:1048589; real_size:1052672; rb->word_size:263168]
Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32400-25)
Apr 17 10:03:40 master corosync[23321]: [QB ]
qb_ipcs_disconnect(23321-32400-25) state:2
Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file
descriptor (9)
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cpg-response-23321-32400-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cpg-event-23321-32400-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cpg-request-23321-32400-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated
(23321-32400-25)
Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32400]
Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589;
real_size:1052672; rb->word_size:263168
Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ]
shm size:1048589; real_size:1052672; rb->word_size:263168]
Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32400-25)
Apr 17 10:03:40 master corosync[23321]: [QB ]
qb_ipcs_disconnect(23321-32400-25) state:2
Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file
descriptor (9)
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cmap-response-23321-32400-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cmap-event-23321-32400-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cmap-request-23321-32400-25-header
Apr 17 10:03:40 master pgsqlms(PGSQL)[32393]: DEBUG: _get_controldata:
found: {
Apr 17 10:03:40 master pgsqlms(PGSQL)[32393]: DEBUG: pgsql_notify:
environment variables: {
Apr 17 10:03:40 master crmd[12481]: notice: Operation PGSQL_notify_0: ok
(node=master, call=52, rc=0, cib-update=0, confirmed=true)
Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5,
Pending=0, Fired=0, Skipped=1, Incomplete=6,
Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped
Apr 17 10:03:40 master nslcd[1518]: [d7e446] <group(all)> ldap_result()
timed out
Apr 17 10:03:40 master pengine[12480]: notice: On loss of CCM Quorum: Ignore
Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
monitor for fencing-secondary on master: unknown error (1)
Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
monitor for PGSQL:0 on master: unknown error (1)
Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
monitor for fencing-master on secondary: unknown error (1)
Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
monitor for PGSQL:1 on secondary: unknown error (1)
Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from
master after 1 failures (max=1)
Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from
master after 1 failures (max=1)
Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-secondary
away from master after 1 failures (max=1)
Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-master away
from secondary after 1 failures (max=1)
Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from
secondary after 1 failures (max=1)
Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from
secondary after 1 failures (max=1)
Apr 17 10:03:40 master pengine[12480]: notice: Stop AWSVIP (master)
Apr 17 10:03:40 master pengine[12480]: notice: Demote PGSQL:0 (Master ->
Stopped master)
Apr 17 10:03:40 master pengine[12480]: notice: Stop PGSQL:1 (secondary)
Apr 17 10:03:40 master pengine[12480]: notice: Calculated Transition 3463:
/var/lib/pacemaker/pengine/pe-input-60.bz2
Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated
(23321-32414-25)
Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32414]
Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589;
real_size:1052672; rb->word_size:263168
Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ]
shm size:1048589; real_size:1052672; rb->word_size:263168]
Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32414-25)
Apr 17 10:03:40 master corosync[23321]: [QB ]
qb_ipcs_disconnect(23321-32414-25) state:2
Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file
descriptor (9)
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cpg-response-23321-32414-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cpg-event-23321-32414-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cpg-request-23321-32414-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated
(23321-32414-25)
Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32414]
Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589;
real_size:1052672; rb->word_size:263168
Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ]
shm size:1048589; real_size:1052672; rb->word_size:263168]
Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32414-25)
Apr 17 10:03:40 master corosync[23321]: [QB ]
qb_ipcs_disconnect(23321-32414-25) state:2
Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file
descriptor (9)
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cmap-response-23321-32414-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cmap-event-23321-32414-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cmap-request-23321-32414-25-header
Apr 17 10:03:40 master pgsqlms(PGSQL)[32407]: DEBUG: _get_controldata:
found: {
Apr 17 10:03:40 master pgsqlms(PGSQL)[32407]: DEBUG: pgsql_notify:
environment variables: {
Apr 17 10:03:40 master pgsqlms(PGSQL)[32407]: DEBUG: _runas: launching as
"postgres" command "/usr/lib/postgresql/9.5/bin/pg_isready -h
/var/run/postgresql/ -p 5432"
Apr 17 10:03:40 master pgsqlms(PGSQL)[32407]: DEBUG: pgsql_monitor:
instance "PGSQL" is listening
Apr 17 10:03:40 master pgsqlms(PGSQL)[32407]: DEBUG: _query: SELECT
pg_is_in_recovery()
Apr 17 10:03:40 master pgsqlms(PGSQL)[32407]: DEBUG: _query: psql return
code: 0
Apr 17 10:03:40 master pgsqlms(PGSQL)[32407]: DEBUG: _query: @res
<https://github.com/res>: [
Apr 17 10:03:40 master pgsqlms(PGSQL)[32407]: DEBUG: _confirm_role:
instance PGSQL is a primary
Apr 17 10:03:40 master crmd[12481]: notice: Operation PGSQL_notify_0: ok
(node=master, call=53, rc=0, cib-update=0, confirmed=true)
Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated
(23321-32438-25)
Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32438]
Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589;
real_size:1052672; rb->word_size:263168
Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ]
shm size:1048589; real_size:1052672; rb->word_size:263168]
Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32438-25)
Apr 17 10:03:40 master corosync[23321]: [QB ]
qb_ipcs_disconnect(23321-32438-25) state:2
Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file
descriptor (9)
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cpg-response-23321-32438-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cpg-event-23321-32438-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cpg-request-23321-32438-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated
(23321-32438-25)
Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32438]
Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589;
real_size:1052672; rb->word_size:263168
Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ]
shm size:1048589; real_size:1052672; rb->word_size:263168]
Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32438-25)
Apr 17 10:03:40 master corosync[23321]: [QB ]
qb_ipcs_disconnect(23321-32438-25) state:2
Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file
descriptor (9)
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cmap-response-23321-32438-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cmap-event-23321-32438-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cmap-request-23321-32438-25-header
Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: _get_controldata:
found: {
Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: _runas: launching as
"postgres" command "/usr/lib/postgresql/9.5/bin/pg_isready -h
/var/run/postgresql/ -p 5432"
Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: pgsql_monitor:
instance "PGSQL" is listening
Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: _query: SELECT
pg_is_in_recovery()
Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: _query: psql return
code: 0
Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: _query: @res
<https://github.com/res>: [
Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: _confirm_role:
instance PGSQL is a primary
Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: pgsql_demote: "PGSQL"
currently running as a primary
Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: _get_action_timeout:
known timeout: 120
Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: _runas: launching as
"postgres" command "/usr/lib/postgresql/9.5/bin/pg_ctl --pgdata
/var/lib/postgresql/9.5/main --mode fast -w --timeout 180 stop"
Apr 17 10:03:40 master postgresql at 9.5-main[32458]: Cluster is not running.
Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service: Control
process exited, code=exited status=2
Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service: Unit
entered failed state.
Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service: Failed with
result 'exit-code'.
Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: _runas: launching as
"postgres" command "/usr/lib/postgresql/9.5/bin/pg_isready -h
/var/run/postgresql/ -p 5432"
Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: pgsql_monitor:
instance "PGSQL" is not listening
Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: _runas: launching as
"postgres" command "/usr/lib/postgresql/9.5/bin/pg_isready -h
/var/run/postgresql/ -p 5432"
Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: pgsql_monitor:
instance "PGSQL" is not listening
Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: _get_controldata:
found: {
Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: pgsql_start: instance
"PGSQL" is not running, starting it as a secondary
Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: _create_recovery_conf:
get replication configuration from the template file
"/etc/postgresql/9.5/main/recovery.conf.pcmk"
Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: _create_recovery_conf:
write the replication configuration to
"/var/lib/postgresql/9.5/main/recovery.conf" file
Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: _get_action_timeout:
known timeout: 120
Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: _runas: launching as
"postgres" command "/usr/lib/postgresql/9.5/bin/pg_ctl --pgdata
/var/lib/postgresql/9.5/main -w --timeout 180 s

postgres.log

2019-04-17 10:03:40 UTC LOG: received fast shutdown request
2019-04-17 10:03:40 UTC LOG: aborting any active transactions
2019-04-17 10:03:40 UTC FATAL: terminating connection due to administrator
command
2019-04-17 10:03:40 UTC LOG: autovacuum launcher shutting down
2019-04-17 10:03:40 UTC LOG: shutting down
2019-04-17 10:03:40 UTC LOG: database system is shut down
syslog_1704.txt
<https://github.com/ClusterLabs/resource-agents/files/3089787/syslog_1704.txt>

Installed packages:

hi corosync 2.3.5-3ubuntu2.1 amd64 cluster engine daemon and utilities
ii libcorosync-common4:amd64 2.3.5-3ubuntu2.1 amd64 cluster engine common
library
hi crmsh 2.2.0-1 amd64 CRM shell for the pacemaker cluster manager
hi pacemaker 1.1.14-2ubuntu1.4 amd64 cluster resource manager
ii pacemaker-cli-utils 1.1.14-2ubuntu1.4 amd64 cluster resource manager
command line utilities
ii pacemaker-common 1.1.14-2ubuntu1.4 all cluster resource manager common
files
ii pacemaker-resource-agents 1.1.14-2ubuntu1.4 all cluster resource manager
general resource agents

OS version:

Distributor ID: Ubuntu
Description: Ubuntu 16.04.5 LTS
Release: 16.04
Codename: xenial

Cluster configuration:

node 1: secondary
attributes master-PGSQL=1000
node 2: master
attributes master-PGSQL=1001
primitive AWSVIP awsvip
params secondary_private_ip=10.x.x.x api_delay=5
primitive PGSQL pgsqlms
params pgdata="/var/lib/postgresql/9.5/main"
bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/"
recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk"
start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf"
op start timeout=60s interval=0
op stop timeout=60s interval=0
op promote timeout=15s interval=0
op demote timeout=120s interval=0
op monitor interval=15s timeout=10s role=Master
op monitor interval=16s timeout=10s role=Slave
op notify timeout=60 interval=0
primitive fencing-master stonith:external/ec2
params port=master
op start interval=0s timeout=60s
op monitor interval=360s timeout=60s
op stop interval=0s timeout=60s
primitive fencing-secondary stonith:external/ec2
params port=secondary
op start interval=0s timeout=60s
op monitor interval=360s timeout=60s
op stop interval=0s timeout=60s
ms PGSQL-HA PGSQL
meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true interleave=true
colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master
order demote-then-stop-ip Mandatory: *rsc_set* PGSQL-HA:demote AWSVIP:stop
symmetrical=false
location loc-fence-master fencing-master -inf: master
location loc-fence-secondary fencing-secondary -inf: secondary
order promote-then-ip Mandatory: *rsc_set* PGSQL-HA:promote AWSVIP:start
symmetrical=false
property cib-bootstrap-options:
have-watchdog=false
dc-version=1.1.14-70404b0
cluster-infrastructure=corosync
cluster-name=cgc-staging-psql
stonith-enabled=true
no-quorum-policy=ignore
last-lrm-refresh=1555497766
maintenance-mode=false
rsc_defaults rsc-options:
resource-stickiness=10
migration-threshold=1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190418/c3a3d44f/attachment-0001.html>


More information about the Users mailing list