[ClusterLabs] PostgreSQL server timelines offset after promote

Wed Mar 27 04:04:39 EDT 2024

Bonjour Thierry,

On Mon, 25 Mar 2024 10:55:06 +0000
FLORAC Thierry <thierry.florac at onf.fr> wrote:

> I'm trying to create a PostgreSQL master/slave cluster using streaming
> replication and pgsqlms agent. Cluster is OK but my problem is this : the
> master node is sometimes restarted for system operations, and the slave is
> then promoted without any problem ; 

When you have to do some planed system operation, you **must** diligently
ask permission to pacemaker. Pacemaker is the real owner of your resource. It
will react to any unexpected event, even if it's a planed one. You must
consider it as a hidden colleague taking care of your resource.

There's various way to deal with Pacemaker when you need to do some system
maintenance on your primary, it depends on your constraints. Here are two
examples:

* ask Pacemaker to move the "promoted" role to another node
* then put the node in standby mode
* then do your admin tasks
* then unstandby your node: a standby should start on the original node
* optional: move back your "promoted" role to original node

Or:

* put the whole cluster in maintenance mode
* then do your admin tasks
* then check everything works as the cluster expect
* then exit the maintenance mode

The second one might be tricky if Pacemaker find some unexpected status/event
when exiting the maintenance mode.

You can find (old) example of administrative tasks in:

* with pcs: https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html
* with crm: https://clusterlabs.github.io/PAF/Debian-8-admin-cookbook.html
* with "low level" commands:
  https://clusterlabs.github.io/PAF/administration.html

These docs updates are long overdue, sorry about that :(

Also, here is an hidden gist (that needs some updates as well):

https://github.com/ClusterLabs/PAF/tree/workshop/docs/workshop/fr

> after reboot, the old master is re-promoted, but I often get an error in
> slave logs :
> 
>   FATAL:  la plus grande timeline 1 du serveur principal est derrière la
>           timeline de restauration 2
> 
> which can be translated in english to :
> 
>   FATAL: the highest timeline 1 of main server is behind restoration timeline
>          2

This is unexpected. I wonder how Pacemaker is being stopped. It is supposed to
stop gracefully its resource. The promotion scores should be updated to reflect
the local resource is not a primary anymore and PostgreSQL should be demoted
then stopped. It is supposed to start as a standby after a graceful shutdown.

Regards,