[ClusterLabs] Help with PostgreSQL Automatic Failover demotion

Fri Feb 18 16:44:58 EST 2022

Hello all,

I have a two-node Pacemaker cluster configured with a high-availability PostgreSQL resource.   Occasionally, I notice that when the Master/primary DB is running under significant write load, the cluster "monitor" operation will time out, thus incrementing the failcount for the ha-db resource.  This happened again recently, and the running primary DB was demoted and then re-promoted to be the running primary.    What I'm having trouble understanding is why the running Master/primary DB was demoted.  After the monitor operation timed out, the failcount for the ha-db resource was still less than the configured "migration-threshold", which is set to 5.

Here's some information on the cluster and the scenario:

OS:  RHEL 7.9
Corosync: 2.4.5-7
Pacemaker:  1.1.23-1
PAF resource agent: 2.3.0
Master DB node:  dbsrv2
Standby DB node: dbsrv1
Other resources:  fencing and VIP
Cluster resource defaults:
  migration-threshold=5
  resource-stickiness=10

The overall scenario is this:

  *   The Master/primary DB is running on node dbsrv2
  *   A monitoring timeout occurs on node dbsrv2
  *   The master resource for node dbsrv2 is incremented from 2 to 3  (there were 2 other failures days before)
  *   The master DB is demoted to standby (i.e. stopped and restarted) on node dbsrv2
  *   Node dbsrv2 is re-promoted to Master/primary, with the failcount remaining at 3

My expectation was that no demotion should have occurred until the migration threshold was met.   Any help in explaining why the running primary was demoted would be appreciated!

I've attached the corosync log from the time period this occurred.  The scenario I described starts at 20:55 on 2/16/22 in the log.

Thanks,

Larry

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20220218/f5c9ba21/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync.log-20220217.gz
Type: application/x-gzip
Size: 23219 bytes
Desc: corosync.log-20220217.gz
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20220218/f5c9ba21/attachment-0001.bin>