[ClusterLabs] Fwd: Postgres pacemaker cluster failure

Thu Apr 18 12:24:26 EDT 2019

On Thu, 18 Apr 2019 14:19:44 +0200
Danka Ivanović <danka.ivanovic at gmail.com> wrote:

It seems you had timeout for both fencing resources and your standby in the same
time here:

> Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
>   monitor for fencing-secondary on master: unknown error (1)
> Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
>   monitor for fencing-master on secondary: unknown error (1)
> Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
>   monitor for PGSQL:1 on secondary: unknown error (1)
> Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary
>   away from master after 1 failures (max=1)
> Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away
>   from secondary after 1 failures (max=1)
> Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from
>   secondary after 1 failures (max=1)
> Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from
>   secondary after 1 failures (max=1)

Because you have "migration-threshold=1", the standby will be shut down:

> Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary)

The transition is stopped because the pgsql master timed out in the meantime
:

> Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5,
> Pending=0, Fired=0, Skipped=1, Incomplete=6,
> Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped

and as you mentioned, your ldap as well:

> Apr 17 10:03:40 master nslcd[1518]: [d7e446] <group(all)> ldap_result()
> timed out

Here are the four timeout errors (2 fencings and 2 pgsql instances):

> Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
>   monitor for fencing-secondary on master: unknown error (1)
> Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
>   monitor for PGSQL:0 on master: unknown error (1)
> Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
>   monitor for fencing-master on secondary: unknown error (1)
> Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
>   monitor for PGSQL:1 on secondary: unknown error (1)

As a reaction, Pacemaker decide to stop everything because it can not move
resources anywhere:

> Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from
> master after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from
> master after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-secondary
> away from master after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-master away
> from secondary after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from
> secondary after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from
> secondary after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: notice: Stop AWSVIP (master)
> Apr 17 10:03:40 master pengine[12480]: notice: Demote PGSQL:0 (Master ->
> Stopped master)
> Apr 17 10:03:40 master pengine[12480]: notice: Stop PGSQL:1 (secondary)

Now, following lines are really not expected. Why systemd detects PostgreSQL
stopped?

> Apr 17 10:03:40 master postgresql at 9.5-main[32458]: Cluster is not running.
> Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service: Control
> process exited, code=exited status=2
> Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service: Unit
> entered failed state.
> Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service: Failed with
> result 'exit-code'.

I suspect the service is still enabled or has been started by hand.

As soon as you setup a resource in Pacemaker, admin show **always** ask
Pacemaker to start/stop it. Never use systemctl to handle the resource yourself.

You must disable this service in systemd.

++