[ClusterLabs] Fwd: Postgres pacemaker cluster failure
Danka Ivanović
danka.ivanovic at gmail.com
Fri Apr 19 05:08:33 EDT 2019
Hi,
Thank you for your response.
Ok, It seems that fencing resources and secondary timed out at the same
time, together with ldap.
I understand that because of "migration-threshold=1", standby tried to
recover just once and then was stopped. Is this ok, or the threshold should
be increased?
Master server is started with systmectl, then pacemaker is started on
master, which detects master and then when starting pacemaker on secondary
it brings up postgres service in slave mode.
I didn't manage to start postgres master over pacemaker. I tested
failover with setup like this and it works. I will try to setup postgres to
be run with pacemaker, but I am concerned about those timeouts which
caused cluster to crash. Can you help me investigate why this happened or
what should I change in order to avoid it? For aws virtual ip is used AWS
secondary IP.
Link to the awsvip resource:
https://raw.githubusercontent.com/ClusterLabs/resource-agents/master/heartbeat/awsvip
Link to the ec2 stonith reosurce agent:
https://raw.githubusercontent.com/ClusterLabs/cluster-glue/master/lib/plugins/stonith/external/ec2
Command output when cluster works:
crm status
Output:
Stack: corosync
Current DC: postgres-ha-1 (version 1.1.14-70404b0) - partition with quorum
2 nodes and 5 resources configured
Online: [ postgres-ha-1 postgres-ha-2 ]
Full list of resources:
AWSVIP (ocf::heartbeat:awsvip): Started postgres-ha-1
Master/Slave Set: PGSQL-HA [PGSQL]
Masters: [ postgres-ha-1 ]
Slaves: [ postgres-ha-2 ]
fencing-postgres-ha-1 (stonith:external/ec2): Started postgres-ha-2
fencing-postgres-ha-2 (stonith:external/ec2): Started postgres-ha-1
On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
wrote:
> On Thu, 18 Apr 2019 14:19:44 +0200
> Danka Ivanović <danka.ivanovic at gmail.com> wrote:
>
>
>
> It seems you had timeout for both fencing resources and your standby in
> the same
> time here:
>
> > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > monitor for fencing-secondary on master: unknown error (1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > monitor for fencing-master on secondary: unknown error (1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > monitor for PGSQL:1 on secondary: unknown error (1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary
> > away from master after 1 failures (max=1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master
> away
> > from secondary after 1 failures (max=1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> > secondary after 1 failures (max=1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> > secondary after 1 failures (max=1)
>
> Because you have "migration-threshold=1", the standby will be shut down:
>
> > Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary)
>
> The transition is stopped because the pgsql master timed out in the
> meantime
> :
>
> > Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5,
> > Pending=0, Fired=0, Skipped=1, Incomplete=6,
> > Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped
>
> and as you mentioned, your ldap as well:
>
> > Apr 17 10:03:40 master nslcd[1518]: [d7e446] <group(all)> ldap_result()
> > timed out
>
> Here are the four timeout errors (2 fencings and 2 pgsql instances):
>
> > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> > monitor for fencing-secondary on master: unknown error (1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> > monitor for PGSQL:0 on master: unknown error (1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> > monitor for fencing-master on secondary: unknown error (1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> > monitor for PGSQL:1 on secondary: unknown error (1)
>
> As a reaction, Pacemaker decide to stop everything because it can not move
> resources anywhere:
>
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> > master after 1 failures (max=1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> > master after 1 failures (max=1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-secondary
> > away from master after 1 failures (max=1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-master
> away
> > from secondary after 1 failures (max=1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> > secondary after 1 failures (max=1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> > secondary after 1 failures (max=1)
> > Apr 17 10:03:40 master pengine[12480]: notice: Stop AWSVIP (master)
> > Apr 17 10:03:40 master pengine[12480]: notice: Demote PGSQL:0 (Master ->
> > Stopped master)
> > Apr 17 10:03:40 master pengine[12480]: notice: Stop PGSQL:1 (secondary)
>
> Now, following lines are really not expected. Why systemd detects
> PostgreSQL
> stopped?
>
> > Apr 17 10:03:40 master postgresql at 9.5-main[32458]: Cluster is not
> running.
> > Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service: Control
> > process exited, code=exited status=2
> > Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service: Unit
> > entered failed state.
> > Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service: Failed
> with
> > result 'exit-code'.
>
> I suspect the service is still enabled or has been started by hand.
>
> As soon as you setup a resource in Pacemaker, admin show **always** ask
> Pacemaker to start/stop it. Never use systemctl to handle the resource
> yourself.
>
> You must disable this service in systemd.
>
> ++
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190419/820bf802/attachment.html>
More information about the Users
mailing list