[ClusterLabs] Fwd: Postgres pacemaker cluster failure

Fri Apr 19 05:46:43 EDT 2019

On Fri, 19 Apr 2019 11:08:33 +0200
Danka Ivanović <danka.ivanovic at gmail.com> wrote:

> Hi,
> Thank you for your response.
> 
> Ok, It seems that fencing resources and secondary timed out at the same
> time, together with ldap.
> I understand that because of "migration-threshold=1", standby tried to
> recover just once and then was stopped. Is this ok, or the threshold should
> be increased?

It depend on your usecase really.

Note that as soon as a resource hit migration threashold, there's an implicit
constraint forbidding it to come back on this node until you reset the
failcount. That's why your pgsql master resource never came back anywhere.

You can as well set failure-timeout if you are brave enough to automate the
failure reset. See:
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html

> Master server is started with systmectl, then pacemaker is started on
> master, which detects master and then when starting pacemaker on secondary
> it brings up postgres service in slave mode.

You should not. Systemd should not mess with resources handled by Pacemaker.

> I didn't manage to start postgres master over pacemaker. I tested
> failover with setup like this and it works. I will try to setup postgres to
> be run with pacemaker,

Pacemaker is suppose to start the resource itself if it is enabled in its
setup. Look at this whole chapter (its end is important):
https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#starting-or-stopping-the-cluster

> but I am concerned about those timeouts which
> caused cluster to crash. Can you help me investigate why this happened or
> what should I change in order to avoid it? For aws virtual ip is used AWS
> secondary IP.

Really I can't help on this. It looks like suddenly both VMs froze most of
their processes, or maybe some kind of clock jump, exhausting the timeouts...I
really don't know.

It sounds more related to your virtualization stack I suppose. Maybe some kind
of "hot" backup? Maybe the hypervisor didn't schedule enough CPU to your VMs
for too long?

This is surprising both VM had timeouts in almost the same time. Do you know if
they are on the same hypervisor host? If they do, this is a SPoF: you should
move one of them in another host.

++

> On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
> wrote:
> 
> > On Thu, 18 Apr 2019 14:19:44 +0200
> > Danka Ivanović <danka.ivanovic at gmail.com> wrote:
> >
> >
> >
> > It seems you had timeout for both fencing resources and your standby in
> > the same
> > time here:
> >  
> > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > >   monitor for fencing-secondary on master: unknown error (1)
> > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > >   monitor for fencing-master on secondary: unknown error (1)
> > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > >   monitor for PGSQL:1 on secondary: unknown error (1)
> > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary
> > >   away from master after 1 failures (max=1)
> > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master  
> > away  
> > >   from secondary after 1 failures (max=1)
> > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away  
> > from  
> > >   secondary after 1 failures (max=1)
> > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away  
> > from  
> > >   secondary after 1 failures (max=1)  
> >
> > Because you have "migration-threshold=1", the standby will be shut down:
> >  
> > > Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary)  
> >
> > The transition is stopped because the pgsql master timed out in the
> > meantime
> > :
> >  
> > > Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5,
> > > Pending=0, Fired=0, Skipped=1, Incomplete=6,
> > > Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped  
> >
> > and as you mentioned, your ldap as well:
> >  
> > > Apr 17 10:03:40 master nslcd[1518]: [d7e446] <group(all)> ldap_result()
> > > timed out  
> >
> > Here are the four timeout errors (2 fencings and 2 pgsql instances):
> >  
> > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> > >   monitor for fencing-secondary on master: unknown error (1)
> > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> > >   monitor for PGSQL:0 on master: unknown error (1)
> > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> > >   monitor for fencing-master on secondary: unknown error (1)
> > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> > >   monitor for PGSQL:1 on secondary: unknown error (1)  
> >
> > As a reaction, Pacemaker decide to stop everything because it can not move
> > resources anywhere:
> >  
> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away  
> > from  
> > > master after 1 failures (max=1)
> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away  
> > from  
> > > master after 1 failures (max=1)
> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-secondary
> > > away from master after 1 failures (max=1)
> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-master  
> > away  
> > > from secondary after 1 failures (max=1)
> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away  
> > from  
> > > secondary after 1 failures (max=1)
> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away  
> > from  
> > > secondary after 1 failures (max=1)
> > > Apr 17 10:03:40 master pengine[12480]: notice: Stop AWSVIP (master)
> > > Apr 17 10:03:40 master pengine[12480]: notice: Demote PGSQL:0 (Master ->
> > > Stopped master)
> > > Apr 17 10:03:40 master pengine[12480]: notice: Stop PGSQL:1 (secondary)  
> >
> > Now, following lines are really not expected. Why systemd detects
> > PostgreSQL
> > stopped?
> >  
> > > Apr 17 10:03:40 master postgresql at 9.5-main[32458]: Cluster is not  
> > running.  
> > > Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service: Control
> > > process exited, code=exited status=2
> > > Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service: Unit
> > > entered failed state.
> > > Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service: Failed  
> > with  
> > > result 'exit-code'.  
> >
> > I suspect the service is still enabled or has been started by hand.
> >
> > As soon as you setup a resource in Pacemaker, admin show **always** ask
> > Pacemaker to start/stop it. Never use systemctl to handle the resource
> > yourself.
> >
> > You must disable this service in systemd.