[ClusterLabs] Fwd: Postgres pacemaker cluster failure

Fri Apr 19 06:23:32 EDT 2019

Thanks for the clarification about failure-timeout, migration threshold and
pacemaker.
Instances are hosted on AWS cloud, and they are in the same security groups
and availability zones.
I don't have information about hardware which hosts those VMs since they
are non dedicated. UTC timezone is configured on both machines and default
ntp configuration.
     remote           refid      st t when poll reach   delay   offset
jitter
==============================================================================
 0.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000
 0.000
 1.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000
 0.000
 2.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000
 0.000
 3.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000
 0.000
 ntp.ubuntu.com  .POOL.          16 p    -   64    0    0.000    0.000
 0.000
+198.46.223.227  204.9.54.119     2 u   65  512  377   22.318    0.096
 1.111
-time1.plumdev.n .GPS.            1 u  116  512  377   72.487    1.386
 0.544
-199.180.133.100 140.142.2.8      3 u  839 1024  377   65.574   -1.199
 1.167
+helium.constant 128.59.0.245     2 u  217  512  377    7.368    0.952
 0.090
*i.will.not.be.e 213.251.128.249  2 u  207  512  377   14.733    1.185
 0.305

On Fri, 19 Apr 2019 at 11:46, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
wrote:

> On Fri, 19 Apr 2019 11:08:33 +0200
> Danka Ivanović <danka.ivanovic at gmail.com> wrote:
>
> > Hi,
> > Thank you for your response.
> >
> > Ok, It seems that fencing resources and secondary timed out at the same
> > time, together with ldap.
> > I understand that because of "migration-threshold=1", standby tried to
> > recover just once and then was stopped. Is this ok, or the threshold
> should
> > be increased?
>
> It depend on your usecase really.
>
> Note that as soon as a resource hit migration threashold, there's an
> implicit
> constraint forbidding it to come back on this node until you reset the
> failcount. That's why your pgsql master resource never came back anywhere.
>
> You can as well set failure-timeout if you are brave enough to automate the
> failure reset. See:
>
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html
>
> > Master server is started with systmectl, then pacemaker is started on
> > master, which detects master and then when starting pacemaker on
> secondary
> > it brings up postgres service in slave mode.
>
> You should not. Systemd should not mess with resources handled by
> Pacemaker.
>
> > I didn't manage to start postgres master over pacemaker. I tested
> > failover with setup like this and it works. I will try to setup postgres
> to
> > be run with pacemaker,
>
> Pacemaker is suppose to start the resource itself if it is enabled in its
> setup. Look at this whole chapter (its end is important):
>
> https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#starting-or-stopping-the-cluster
>
> > but I am concerned about those timeouts which
> > caused cluster to crash. Can you help me investigate why this happened or
> > what should I change in order to avoid it? For aws virtual ip is used AWS
> > secondary IP.
>
> Really I can't help on this. It looks like suddenly both VMs froze most of
> their processes, or maybe some kind of clock jump, exhausting the
> timeouts...I
> really don't know.
>
> It sounds more related to your virtualization stack I suppose. Maybe some
> kind
> of "hot" backup? Maybe the hypervisor didn't schedule enough CPU to your
> VMs
> for too long?
>
> This is surprising both VM had timeouts in almost the same time. Do you
> know if
> they are on the same hypervisor host? If they do, this is a SPoF: you
> should
> move one of them in another host.
>
> ++
>
> > On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais <
> jgdr at dalibo.com>
> > wrote:
> >
> > > On Thu, 18 Apr 2019 14:19:44 +0200
> > > Danka Ivanović <danka.ivanovic at gmail.com> wrote:
> > >
> > >
> > >
> > > It seems you had timeout for both fencing resources and your standby in
> > > the same
> > > time here:
> > >
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > > >   monitor for fencing-secondary on master: unknown error (1)
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > > >   monitor for fencing-master on secondary: unknown error (1)
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > > >   monitor for PGSQL:1 on secondary: unknown error (1)
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing
> fencing-secondary
> > > >   away from master after 1 failures (max=1)
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing
> fencing-master
> > > away
> > > >   from secondary after 1 failures (max=1)
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA
> away
> > > from
> > > >   secondary after 1 failures (max=1)
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA
> away
> > > from
> > > >   secondary after 1 failures (max=1)
> > >
> > > Because you have "migration-threshold=1", the standby will be shut
> down:
> > >
> > > > Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1
> (secondary)
> > >
> > > The transition is stopped because the pgsql master timed out in the
> > > meantime
> > > :
> > >
> > > > Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462
> (Complete=5,
> > > > Pending=0, Fired=0, Skipped=1, Incomplete=6,
> > > > Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped
> > >
> > > and as you mentioned, your ldap as well:
> > >
> > > > Apr 17 10:03:40 master nslcd[1518]: [d7e446] <group(all)>
> ldap_result()
> > > > timed out
> > >
> > > Here are the four timeout errors (2 fencings and 2 pgsql instances):
> > >
> > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> > > >   monitor for fencing-secondary on master: unknown error (1)
> > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> > > >   monitor for PGSQL:0 on master: unknown error (1)
> > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> > > >   monitor for fencing-master on secondary: unknown error (1)
> > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> > > >   monitor for PGSQL:1 on secondary: unknown error (1)
> > >
> > > As a reaction, Pacemaker decide to stop everything because it can not
> move
> > > resources anywhere:
> > >
> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA
> away
> > > from
> > > > master after 1 failures (max=1)
> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA
> away
> > > from
> > > > master after 1 failures (max=1)
> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing
> fencing-secondary
> > > > away from master after 1 failures (max=1)
> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing
> fencing-master
> > > away
> > > > from secondary after 1 failures (max=1)
> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA
> away
> > > from
> > > > secondary after 1 failures (max=1)
> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA
> away
> > > from
> > > > secondary after 1 failures (max=1)
> > > > Apr 17 10:03:40 master pengine[12480]: notice: Stop AWSVIP (master)
> > > > Apr 17 10:03:40 master pengine[12480]: notice: Demote PGSQL:0
> (Master ->
> > > > Stopped master)
> > > > Apr 17 10:03:40 master pengine[12480]: notice: Stop PGSQL:1
> (secondary)
> > >
> > > Now, following lines are really not expected. Why systemd detects
> > > PostgreSQL
> > > stopped?
> > >
> > > > Apr 17 10:03:40 master postgresql at 9.5-main[32458]: Cluster is not
> > > running.
> > > > Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service:
> Control
> > > > process exited, code=exited status=2
> > > > Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service: Unit
> > > > entered failed state.
> > > > Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service:
> Failed
> > > with
> > > > result 'exit-code'.
> > >
> > > I suspect the service is still enabled or has been started by hand.
> > >
> > > As soon as you setup a resource in Pacemaker, admin show **always** ask
> > > Pacemaker to start/stop it. Never use systemctl to handle the resource
> > > yourself.
> > >
> > > You must disable this service in systemd.
>

-- 
Pozdrav
Danka Ivanovic
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190419/23af275b/attachment.html>