[ClusterLabs] Fwd: Fwd: Postgres pacemaker cluster failure
Danka Ivanovic
danka.ivanovic at sbgenomics.com
Wed Jul 10 10:34:17 EDT 2019
Hi, Thank you all for responding so quickly. Part of corosync.log file is
attached. Cluster failure occured in 09:16 AM yesterday.
Debug mode is turned on in corosync configuration, but I didn't turn it on
in pacemaker config. I will test that. Postgres log is also attached.
Several times cluster failed because of ldap time out, even if I tried to
disable ldap searching for local postgres user, then it also failed after
pacemaker automatic update, so several packages are on hold now. But I
cannot figure what caused failure now.
>From syslog it looks like postgres systemd process was
stoped, postgres_exporter is just scirpt for monitoring postgres
replication.
On Tue, 9 Jul 2019 19:57:06 +0300
> Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>
> > 09.07.2019 13:08, Danka Ivanović пишет:
> > > Hi I didn't manage to start master with postgres, even if I increased
> start
> > > timeout. I checked executable paths and start options.
>
> We would require much more logs from this failure...
>
> > > When cluster is running with manually started master and slave started
> over
> > > pacemaker, everything works ok.
>
> Logs from this scenario might be interesting as well to check and compare.
>
> > > Today we had failover again.
> > > I cannot find reason from the logs, can you help me with debugging?
> Thanks.
>
> logs logs logs please.
>
> > > Jul 09 09:16:32 [2679] postgres1 lrmd: debug:
> > > child_kill_helper: Kill pid 12735's group Jul 09 09:16:34 [2679]
> > > postgres1 lrmd: warning: child_timeout_callback:
> > > PGSQL_monitor_15000 process (PID 12735) timed out
> >
> > You probably want to enable debug output in resource agent. As far as I
> > can tell, this requires HA_debug=1 in environment of resource agent, but
> > for the life of me I cannot find where it is possible to set it.
> >
> > Probably setting it directly in resource agent for debugging is the most
> > simple way.
>
> I usually set this in "/etc/sysconfig/pacemaker". Never tried to add it
> to pgsqlms, interesting.
>
> > P.S. crm_resource is called by resource agent (pgsqlms). And it shows
> > result of original resource probing which makes it confusing. At least
> > it explains where these logs entries come from.
>
> Not sure tu understand what you mean :/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190710/3edcf914/attachment-0001.html>
-------------- next part --------------
2019-07-09 09:16:37 UTC LOG: database system was shut down at 2019-07-09 09:16:35 UTC
2019-07-09 09:16:37 UTC LOG: entering standby mode
2019-07-09 09:16:37 UTC LOG: consistent recovery state reached at 3F/BEE5B100
2019-07-09 09:16:37 UTC LOG: invalid record length at 3F/BEE5B100
2019-07-09 09:16:37 UTC FATAL: the database system is starting up
2019-07-09 09:16:37 UTC FATAL: the database system is starting up
2019-07-09 09:16:37 UTC FATAL: could not connect to the primary server: FATAL: the database system is starting up
FATAL: the database system is starting up
2019-07-09 09:16:37 UTC FATAL: the database system is starting up
2019-07-09 09:16:37 UTC FATAL: the database system is starting up
2019-07-09 09:16:37 UTC FATAL: could not connect to the primary server: FATAL: the database system is starting up
FATAL: the database system is starting up
2019-07-09 09:16:37 UTC FATAL: the database system is starting up
2019-07-09 09:16:37 UTC FATAL: the database system is starting up
2019-07-09 09:16:37 UTC FATAL: the database system is starting up
2019-07-09 09:16:37 UTC FATAL: the database system is starting up
2019-07-09 09:16:37 UTC FATAL: the database system is starting up
2019-07-09 09:16:37 UTC FATAL: the database system is starting up
2019-07-09 09:16:37 UTC FATAL: the database system is starting up
2019-07-09 09:16:37 UTC FATAL: the database system is starting up
2019-07-09 09:16:38 UTC FATAL: the database system is starting up
2019-07-09 09:16:38 UTC FATAL: the database system is starting up
2019-07-09 09:16:38 UTC FATAL: the database system is starting up
2019-07-09 09:16:38 UTC FATAL: the database system is starting up
2019-07-09 09:16:38 UTC FATAL: the database system is starting up
2019-07-09 09:16:38 UTC FATAL: the database system is starting up
2019-07-09 09:16:38 UTC FATAL: the database system is starting up
2019-07-09 09:16:38 UTC FATAL: the database system is starting up
2019-07-09 09:16:38 UTC FATAL: the database system is starting up
2019-07-09 09:16:38 UTC FATAL: the database system is starting up
2019-07-09 09:16:38 UTC FATAL: the database system is starting up
2019-07-09 09:16:38 UTC FATAL: the database system is starting up
2019-07-09 09:16:38 UTC FATAL: the database system is starting up
2019-07-09 09:16:38 UTC FATAL: the database system is starting up
2019-07-09 09:16:38 UTC FATAL: the database system is starting up
2019-07-09 09:16:39 UTC FATAL: the database system is starting up
2019-07-09 09:16:39 UTC FATAL: the database system is starting up
2019-07-09 09:16:39 UTC FATAL: the database system is starting up
2019-07-09 09:16:39 UTC FATAL: the database system is starting up
2019-07-09 09:16:39 UTC FATAL: the database system is starting up
2019-07-09 09:16:39 UTC FATAL: the database system is starting up
2019-07-09 09:16:39 UTC FATAL: the database system is starting up
2019-07-09 09:16:39 UTC FATAL: the database system is starting up
2019-07-09 09:16:39 UTC FATAL: the database system is starting up
2019-07-09 09:16:39 UTC FATAL: the database system is starting up
2019-07-09 09:16:39 UTC FATAL: the database system is starting up
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync_0907_1.log
Type: application/octet-stream
Size: 2739749 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190710/3edcf914/attachment-0001.obj>
-------------- next part --------------
Jul 9 09:03:01 postgres1 bash_exporter[1392]: map[hostname:postgres1 env: verb:replication job:psql_replication]
Jul 9 09:05:02 postgres1 CRON[10949]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 9 09:08:01 postgres1 bash_exporter[1392]: map[job:psql_replication hostname:postgres1 env: verb:replication]
Jul 9 09:13:01 postgres1 bash_exporter[1392]: map[job:psql_replication hostname:postgres1 env: verb:replication]
Jul 9 09:15:01 postgres1 CRON[12543]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 9 09:16:35 postgres1 postgresql at 9.5-main[12781]: Cluster is not running.
Jul 9 09:16:35 postgres1 systemd[1]: postgresql at 9.5-main.service: Control process exited, code=exited status=2
Jul 9 09:16:35 postgres1 systemd[1]: postgresql at 9.5-main.service: Unit entered failed state.
Jul 9 09:16:35 postgres1 systemd[1]: postgresql at 9.5-main.service: Failed with result 'exit-code'.
Jul 9 09:16:46 postgres1 postgres_exporter[1433]: time="2019-07-09T09:16:46Z" level=warning msg="Proceeding with outdated query maps, as the Postgres version could not be determined: Error scanning version string: pq: the database system is starting up" source="postgres_exporter.go:966"
Jul 9 09:16:46 postgres1 postgres_exporter[1433]: time="2019-07-09T09:16:46Z" level=info msg="Error retrieving settings: Error running query on database: pg pq: the database system is starting up\n" source="postgres_exporter.go:974"
Jul 9 09:16:46 postgres1 postgres_exporter[1433]: time="2019-07-09T09:16:46Z" level=info msg="Error running query on database: pg_stat_replication pq: the database system is starting up\n" source="postgres_exporter.go:869"
Jul 9 09:16:46 postgres1 postgres_exporter[1433]: time="2019-07-09T09:16:46Z" level=info msg="Error running query on database: pg_stat_activity pq: the database system is starting up\n" source="postgres_exporter.go:869"
Jul 9 09:16:46 postgres1 postgres_exporter[1433]: time="2019-07-09T09:16:46Z" level=info msg="Error running query on database: pg_stat_bgwriter pq: the database system is starting up\n" source="postgres_exporter.go:869"
Jul 9 09:16:46 postgres1 postgres_exporter[1433]: time="2019-07-09T09:16:46Z" level=info msg="Error running query on database: pg_stat_database pq: the database system is starting up\n" source="postgres_exporter.go:869"
Jul 9 09:16:46 postgres1 postgres_exporter[1433]: time="2019-07-09T09:16:46Z" level=info msg="Error running query on database: pg_stat_database_conflicts pq: the database system is starting up\n" source="postgres_exporter.go:869"
Jul 9 09:16:46 postgres1 postgres_exporter[1433]: time="2019-07-09T09:16:46Z" level=info msg="Error running query on database: pg_locks pq: the database system is starting up\n" source="postgres_exporter.go:869"
Jul 9 09:17:00 postgres1 postgres_exporter[1433]: time="2019-07-09T09:17:00Z" level=warning msg="Proceeding with outdated query maps, as the Postgres version could not be determined: Error scanning version string: pq: the database system is starting up" source="postgres_exporter.go:966"
Jul 9 09:17:01 postgres1 postgres_exporter[1433]: time="2019-07-09T09:17:01Z" level=info msg="Error retrieving settings: Error running query on database: pg pq: the database system is starting up\n" source="postgres_exporter.go:974"
Jul 9 09:17:01 postgres1 postgres_exporter[1433]: time="2019-07-09T09:17:01Z" level=info msg="Error running query on database: pg_stat_database pq: the database system is starting up\n" source="postgres_exporter.go:869"
Jul 9 09:17:01 postgres1 postgres_exporter[1433]: time="2019-07-09T09:17:01Z" level=info msg="Error running query on database: pg_stat_database_conflicts pq: the database system is starting up\n" source="postgres_exporter.go:869"
Jul 9 09:17:01 postgres1 CRON[13656]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jul 9 09:17:01 postgres1 postgres_exporter[1433]: time="2019-07-09T09:17:01Z" level=info msg="Error running query on database: pg_locks pq: the database system is starting up\n" source="postgres_exporter.go:869"
Jul 9 09:17:01 postgres1 postgres_exporter[1433]: time="2019-07-09T09:17:01Z" level=info msg="Error running query on database: pg_stat_replication pq: the database system is starting up\n" source="postgres_exporter.go:869"
More information about the Users
mailing list