[ClusterLabs] Is fencing really a must for Postgres failover?

Mon Feb 11 10:00:25 EST 2019

On Mon, 11 Feb 2019 12:34:44 +0100
Maciej S <internet at swierki.com> wrote:

> I was wondering if anyone can give a plain answer if fencing is really
> needed in case there are no shared resources being used (as far as I define
> shared resource).
> 
> We want to use PAF or other Postgres (with replicated data files on the
> local drives) 

I'm not sure to understand. Are you talking of PostgreSQL internal replication
or FS replication? What do you mean by "on local dirves"?

> failover agent together with Corosync, Pacemaker and virtual
> IP resource and I am wondering if there is a need for fencing (which is
> very close bind to an infrastructure) if a Pacemaker is already controlling
> resources state. I know that in failover case there might be a need to add
> functionality to recover master that entered dirty shutdown state (eg. in
> case of power outage), but I can't see any case where fencing is really
> necessary. Am I wrong?
> 
> I was looking for a strict answer but I couldn't find one...

You need fencing for various reasons:

* with default config, Pacemaker will refuse to promote a new primary if
  the state of current primary is unknown. Fencing is solving this
* a non-responsible primary doesn't imply the primary failed. Before
  failing over, the cluster must be able to define the state of the
  resource/node by itself: this is fencing.
* your service is not secured after a failover if you leave a rogue node in an
  inconsistant state alive. What if the node comes back to activity after
  some freezing time with all its leaving services? IP and PgSQL up? Maybe
  Pacemaker will detect it, but 1) it might takes some time (interval) 2)
  action taken might be the opposite of what you want and triggers another
  unavailability if service must move again
* split brain is usually much worst than a simple service outage.
* I would recommend NOT auto-failing back a node. This is another layer of high
  complexity with additional failures scenario. If a pgsql or node went wrong,
  a human needs to understand why before getting it back on feet.

Good luck.