[ClusterLabs] Fwd: Postgres pacemaker cluster failure

Wed Jul 10 11:25:57 EDT 2019

We tried to fix ldap issue with nss_initgroups_ignoreusers option in
nslcd.conf for postgres and hacluster users. So cluster shouldn't contact
ldap server every 15 seconds when it checks psql with postgres user:
/usr/lib/postgresql/9.5/bin/pg_isready -h /var/run/postgresql/ -p 5432
We have two ldap servers, and when one was unavailable, cluster failed
immediately due to timeout, even if it can reach other ldap server.
I know it should be avoided starting master database with systemctl, but I
didn't find a way to start it with pacemaker. I will test again, but I am
out of ideas. Because I tried with different pgsqlms options, different
versions of postgres..
But now it looks like something else happened..

On Wed, Jul 10, 2019 at 4:57 PM Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
wrote:

> On Wed, 10 Jul 2019 16:34:17 +0200
> Danka Ivanovic <danka.ivanovic at sbgenomics.com> wrote:
>
> > Hi, Thank you all for responding so quickly. Part of corosync.log file is
> > attached. Cluster failure occured in 09:16  AM yesterday.
> > Debug mode is turned on in corosync configuration, but I didn't turn it
> on
> > in pacemaker config. I will test that.
>
> There's really nothing interesting in there sadly. It could even be like
> pgsqlms hadn't been called at all and the action timed out...
>
> > Postgres log is also attached.
>
> Nothing really revelent there as well.
>
> > Several times cluster failed because of ldap time out, even if I tried to
> > disable ldap searching for local postgres user,
>
> This is really anoying. IIRC, this was already happening last time. Fix
> this
> first if you didn't yet?
>
> ...
> > From syslog it looks like postgres systemd process was
> > stoped,
>
> Again, systemd shouldn't take part of anything in your cluster irw
> postgresql.
> If Pacemaker manage PostgreSQL, systemd should have nothing to do with it.
>
> If you really need to start/stop it by hands (I really discourage you to
> do so), do it using pg_ctl. And make sure to unmanage the Pacemaker
> resource
> before.
>
> > On Tue, 9 Jul 2019 19:57:06 +0300
> > > Andrei Borzenkov <arvidjaar at gmail.com> wrote:
> > >
> > > > 09.07.2019 13:08, Danka Ivanović пишет:
> > > > > Hi I didn't manage to start master with postgres, even if I
> increased
> > > start
> > > > > timeout. I checked executable paths and start options.
> > >
> > > We would require much more logs from this failure...
> > >
> > > > > When cluster is running with manually started master and slave
> started
> > > over
> > > > > pacemaker, everything works ok.
> > >
> > > Logs from this scenario might be interesting as well to check and
> compare.
> > >
> > > > > Today we had failover again.
> > > > > I cannot find reason from the logs, can you help me with
> debugging?
> > > Thanks.
> > >
> > > logs logs logs please.
> > >
> > > > > Jul 09 09:16:32 [2679] postgres1       lrmd:    debug:
> > > > > child_kill_helper:  Kill pid 12735's group Jul 09 09:16:34 [2679]
> > > > > postgres1       lrmd:  warning: child_timeout_callback:
> > > > > PGSQL_monitor_15000 process (PID 12735) timed out
> > > >
> > > > You probably want to enable debug output in resource agent. As far
> as I
> > > > can tell, this requires HA_debug=1 in environment of resource agent,
> but
> > > > for the life of me I cannot find where it is possible to set it.
> > > >
> > > > Probably setting it directly in resource agent for debugging is the
> most
> > > > simple way.
> > >
> > > I usually set this in "/etc/sysconfig/pacemaker". Never tried to add it
> > > to pgsqlms, interesting.
> > >
> > > > P.S. crm_resource is called by resource agent (pgsqlms). And it shows
> > > > result of original resource probing which makes it confusing. At
> least
> > > > it explains where these logs entries come from.
> > >
> > > Not sure tu understand what you mean :/
> > >
>
>
>
> --
> Jehan-Guillaume de Rorthais
> Dalibo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190710/62da6854/attachment.html>