[ClusterLabs] Fwd: Postgres pacemaker cluster failure

Fri Apr 19 11:26:14 EDT 2019

Here is the command output from crm configure show:

node 1: master \
attributes master-PGSQL=1001
node 2: secondary \
attributes master-PGSQL=1000
primitive AWSVIP awsvip \
params secondary_private_ip=10.x.x.x api_delay=5
primitive PGSQL pgsqlms \
params pgdata="/var/lib/postgresql/9.5/main"
bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/"
recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk"
start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \
op start timeout=60s interval=0 \
op stop timeout=60s interval=0 \
op promote timeout=15s interval=0 \
op demote timeout=120s interval=0 \
op monitor interval=15s timeout=10s role=Master \
op monitor interval=16s timeout=10s role=Slave \
op notify timeout=60 interval=0
primitive fencing-master stonith:external/ec2 \
params port=master \
op start interval=0s timeout=60s \
op monitor interval=360s timeout=60s \
op stop interval=0s timeout=60s
primitive fencing-secondary stonith:external/ec2 \
params port=secondary \
op start interval=0s timeout=60s \
op monitor interval=360s timeout=60s \
op stop interval=0s timeout=60s
ms PGSQL-HA PGSQL \
meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true interleave=true
colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master
order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote AWSVIP:stop
symmetrical=false
location loc-fence-master fencing-master -inf: master
location loc-fence-secondary fencing-secondary -inf: secondary
order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote AWSVIP:start
symmetrical=false
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync \
cluster-name=pgc-psql-ha \
stonith-enabled=true \
no-quorum-policy=ignore \
maintenance-mode=false \
last-lrm-refresh=1551885417
rsc_defaults rsc-options: \
resource-stickiness=10 \
migration-threshold=1

Should I change any of those timeout parameters in order to avoid timeout?

On Fri, 19 Apr 2019 at 12:23, Danka Ivanović <danka.ivanovic at gmail.com>
wrote:

> Thanks for the clarification about failure-timeout, migration threshold
> and pacemaker.
> Instances are hosted on AWS cloud, and they are in the same security
> groups and availability zones.
> I don't have information about hardware which hosts those VMs since they
> are non dedicated. UTC timezone is configured on both machines and default
> ntp configuration.
>      remote           refid      st t when poll reach   delay   offset
> jitter
>
> ==============================================================================
>  0.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000
>  0.000
>  1.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000
>  0.000
>  2.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000
>  0.000
>  3.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000
>  0.000
>  ntp.ubuntu.com  .POOL.          16 p    -   64    0    0.000    0.000
>  0.000
> +198.46.223.227  204.9.54.119     2 u   65  512  377   22.318    0.096
>  1.111
> -time1.plumdev.n .GPS.            1 u  116  512  377   72.487    1.386
>  0.544
> -199.180.133.100 140.142.2.8      3 u  839 1024  377   65.574   -1.199
>  1.167
> +helium.constant 128.59.0.245     2 u  217  512  377    7.368    0.952
>  0.090
> *i.will.not.be.e 213.251.128.249  2 u  207  512  377   14.733    1.185
>  0.305
>
>
> On Fri, 19 Apr 2019 at 11:46, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
> wrote:
>
>> On Fri, 19 Apr 2019 11:08:33 +0200
>> Danka Ivanović <danka.ivanovic at gmail.com> wrote:
>>
>> > Hi,
>> > Thank you for your response.
>> >
>> > Ok, It seems that fencing resources and secondary timed out at the same
>> > time, together with ldap.
>> > I understand that because of "migration-threshold=1", standby tried to
>> > recover just once and then was stopped. Is this ok, or the threshold
>> should
>> > be increased?
>>
>> It depend on your usecase really.
>>
>> Note that as soon as a resource hit migration threashold, there's an
>> implicit
>> constraint forbidding it to come back on this node until you reset the
>> failcount. That's why your pgsql master resource never came back anywhere.
>>
>> You can as well set failure-timeout if you are brave enough to automate
>> the
>> failure reset. See:
>>
>> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html
>>
>> > Master server is started with systmectl, then pacemaker is started on
>> > master, which detects master and then when starting pacemaker on
>> secondary
>> > it brings up postgres service in slave mode.
>>
>> You should not. Systemd should not mess with resources handled by
>> Pacemaker.
>>
>> > I didn't manage to start postgres master over pacemaker. I tested
>> > failover with setup like this and it works. I will try to setup
>> postgres to
>> > be run with pacemaker,
>>
>> Pacemaker is suppose to start the resource itself if it is enabled in its
>> setup. Look at this whole chapter (its end is important):
>>
>> https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#starting-or-stopping-the-cluster
>>
>> > but I am concerned about those timeouts which
>> > caused cluster to crash. Can you help me investigate why this happened
>> or
>> > what should I change in order to avoid it? For aws virtual ip is used
>> AWS
>> > secondary IP.
>>
>> Really I can't help on this. It looks like suddenly both VMs froze most of
>> their processes, or maybe some kind of clock jump, exhausting the
>> timeouts...I
>> really don't know.
>>
>> It sounds more related to your virtualization stack I suppose. Maybe some
>> kind
>> of "hot" backup? Maybe the hypervisor didn't schedule enough CPU to your
>> VMs
>> for too long?
>>
>> This is surprising both VM had timeouts in almost the same time. Do you
>> know if
>> they are on the same hypervisor host? If they do, this is a SPoF: you
>> should
>> move one of them in another host.
>>
>> ++
>>
>> > On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais <
>> jgdr at dalibo.com>
>> > wrote:
>> >
>> > > On Thu, 18 Apr 2019 14:19:44 +0200
>> > > Danka Ivanović <danka.ivanovic at gmail.com> wrote:
>> > >
>> > >
>> > >
>> > > It seems you had timeout for both fencing resources and your standby
>> in
>> > > the same
>> > > time here:
>> > >
>> > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
>> > > >   monitor for fencing-secondary on master: unknown error (1)
>> > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
>> > > >   monitor for fencing-master on secondary: unknown error (1)
>> > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
>> > > >   monitor for PGSQL:1 on secondary: unknown error (1)
>> > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing
>> fencing-secondary
>> > > >   away from master after 1 failures (max=1)
>> > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing
>> fencing-master
>> > > away
>> > > >   from secondary after 1 failures (max=1)
>> > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA
>> away
>> > > from
>> > > >   secondary after 1 failures (max=1)
>> > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA
>> away
>> > > from
>> > > >   secondary after 1 failures (max=1)
>> > >
>> > > Because you have "migration-threshold=1", the standby will be shut
>> down:
>> > >
>> > > > Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1
>> (secondary)
>> > >
>> > > The transition is stopped because the pgsql master timed out in the
>> > > meantime
>> > > :
>> > >
>> > > > Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462
>> (Complete=5,
>> > > > Pending=0, Fired=0, Skipped=1, Incomplete=6,
>> > > > Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped
>> > >
>> > > and as you mentioned, your ldap as well:
>> > >
>> > > > Apr 17 10:03:40 master nslcd[1518]: [d7e446] <group(all)>
>> ldap_result()
>> > > > timed out
>> > >
>> > > Here are the four timeout errors (2 fencings and 2 pgsql instances):
>> > >
>> > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
>> > > >   monitor for fencing-secondary on master: unknown error (1)
>> > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
>> > > >   monitor for PGSQL:0 on master: unknown error (1)
>> > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
>> > > >   monitor for fencing-master on secondary: unknown error (1)
>> > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
>> > > >   monitor for PGSQL:1 on secondary: unknown error (1)
>> > >
>> > > As a reaction, Pacemaker decide to stop everything because it can not
>> move
>> > > resources anywhere:
>> > >
>> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA
>> away
>> > > from
>> > > > master after 1 failures (max=1)
>> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA
>> away
>> > > from
>> > > > master after 1 failures (max=1)
>> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing
>> fencing-secondary
>> > > > away from master after 1 failures (max=1)
>> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing
>> fencing-master
>> > > away
>> > > > from secondary after 1 failures (max=1)
>> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA
>> away
>> > > from
>> > > > secondary after 1 failures (max=1)
>> > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA
>> away
>> > > from
>> > > > secondary after 1 failures (max=1)
>> > > > Apr 17 10:03:40 master pengine[12480]: notice: Stop AWSVIP (master)
>> > > > Apr 17 10:03:40 master pengine[12480]: notice: Demote PGSQL:0
>> (Master ->
>> > > > Stopped master)
>> > > > Apr 17 10:03:40 master pengine[12480]: notice: Stop PGSQL:1
>> (secondary)
>> > >
>> > > Now, following lines are really not expected. Why systemd detects
>> > > PostgreSQL
>> > > stopped?
>> > >
>> > > > Apr 17 10:03:40 master postgresql at 9.5-main[32458]: Cluster is not
>> > > running.
>> > > > Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service:
>> Control
>> > > > process exited, code=exited status=2
>> > > > Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service:
>> Unit
>> > > > entered failed state.
>> > > > Apr 17 10:03:40 master systemd[1]: postgresql at 9.5-main.service:
>> Failed
>> > > with
>> > > > result 'exit-code'.
>> > >
>> > > I suspect the service is still enabled or has been started by hand.
>> > >
>> > > As soon as you setup a resource in Pacemaker, admin show **always**
>> ask
>> > > Pacemaker to start/stop it. Never use systemctl to handle the resource
>> > > yourself.
>> > >
>> > > You must disable this service in systemd.
>>
>
>
> --
> Pozdrav
> Danka Ivanovic
>

-- 
Pozdrav
Danka Ivanovic
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190419/b2e7a7a9/attachment-0001.html>