[ClusterLabs] Is fencing really a must for Postgres failover?
Jehan-Guillaume de Rorthais
jgdr at dalibo.com
Mon Feb 11 10:00:25 EST 2019
On Mon, 11 Feb 2019 12:34:44 +0100
Maciej S <internet at swierki.com> wrote:
> I was wondering if anyone can give a plain answer if fencing is really
> needed in case there are no shared resources being used (as far as I define
> shared resource).
>
> We want to use PAF or other Postgres (with replicated data files on the
> local drives)
I'm not sure to understand. Are you talking of PostgreSQL internal replication
or FS replication? What do you mean by "on local dirves"?
> failover agent together with Corosync, Pacemaker and virtual
> IP resource and I am wondering if there is a need for fencing (which is
> very close bind to an infrastructure) if a Pacemaker is already controlling
> resources state. I know that in failover case there might be a need to add
> functionality to recover master that entered dirty shutdown state (eg. in
> case of power outage), but I can't see any case where fencing is really
> necessary. Am I wrong?
>
> I was looking for a strict answer but I couldn't find one...
You need fencing for various reasons:
* with default config, Pacemaker will refuse to promote a new primary if
the state of current primary is unknown. Fencing is solving this
* a non-responsible primary doesn't imply the primary failed. Before
failing over, the cluster must be able to define the state of the
resource/node by itself: this is fencing.
* your service is not secured after a failover if you leave a rogue node in an
inconsistant state alive. What if the node comes back to activity after
some freezing time with all its leaving services? IP and PgSQL up? Maybe
Pacemaker will detect it, but 1) it might takes some time (interval) 2)
action taken might be the opposite of what you want and triggers another
unavailability if service must move again
* split brain is usually much worst than a simple service outage.
* I would recommend NOT auto-failing back a node. This is another layer of high
complexity with additional failures scenario. If a pgsql or node went wrong,
a human needs to understand why before getting it back on feet.
Good luck.
More information about the Users
mailing list