[ClusterLabs] Fwd: Postgres pacemaker cluster failure
Jehan-Guillaume de Rorthais
jgdr at dalibo.com
Thu Apr 18 12:24:26 EDT 2019
On Thu, 18 Apr 2019 14:19:44 +0200
Danka Ivanović <danka.ivanovic at gmail.com> wrote:
It seems you had timeout for both fencing resources and your standby in the same
time here:
> Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> monitor for fencing-secondary on master: unknown error (1)
> Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> monitor for fencing-master on secondary: unknown error (1)
> Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> monitor for PGSQL:1 on secondary: unknown error (1)
> Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary
> away from master after 1 failures (max=1)
> Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away
> from secondary after 1 failures (max=1)
> Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from
> secondary after 1 failures (max=1)
> Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from
> secondary after 1 failures (max=1)
Because you have "migration-threshold=1", the standby will be shut down:
> Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary)
The transition is stopped because the pgsql master timed out in the meantime
:
> Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5,
> Pending=0, Fired=0, Skipped=1, Incomplete=6,
> Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped
and as you mentioned, your ldap as well:
> Apr 17 10:03:40 master nslcd[1518]: [d7e446] <group(all)> ldap_result()
> timed out
Here are the four timeout errors (2 fencings and 2 pgsql instances):
> Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> monitor for fencing-secondary on master: unknown error (1)
> Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> monitor for PGSQL:0 on master: unknown error (1)
> Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> monitor for fencing-master on secondary: unknown error (1)
> Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> monitor for PGSQL:1 on secondary: unknown error (1)
As a reaction, Pacemaker decide to stop everything because it can not move
resources anywhere:
> Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from
> master after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from
> master after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-secondary
> away from master after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-master away
> from secondary after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from
> secondary after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from
> secondary after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: notice: Stop AWSVIP (master)
> Apr 17 10:03:40 master pengine[12480]: notice: Demote PGSQL:0 (Master ->
> Stopped master)
> Apr 17 10:03:40 master pengine[12480]: notice: Stop PGSQL:1 (secondary)
Now, following lines are really not expected. Why systemd detects PostgreSQL
stopped?
> Apr 17 10:03:40 master postgresql at 9.5-main[32458]: Cluster is not running.
> Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service: Control
> process exited, code=exited status=2
> Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service: Unit
> entered failed state.
> Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service: Failed with
> result 'exit-code'.
I suspect the service is still enabled or has been started by hand.
As soon as you setup a resource in Pacemaker, admin show **always** ask
Pacemaker to start/stop it. Never use systemctl to handle the resource yourself.
You must disable this service in systemd.
++
More information about the Users
mailing list