[ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

Sun May 27 20:28:42 UTC 2018

On Wed, 2018-05-23 at 14:22 -0600, Casey & Gina wrote:
> I have pcsd set to auto-start at boot, but not pacemaker or
> corosync.  After I power off the node in vSphere, the node is fenced
> and then powered back on.  I see it show up in `pcs status` with PCSD
> Status of Online after a few seconds but shown as OFFLINE in the list
> of nodes on top since pacemaker and corosync are not running.  If I
> then do a `pcs cluster start` on the rebooted node, it is again
> restarted.  So I cannot get it to rejoin the cluster.
> 
> The corosync log from another node in the cluster (pasted below)
> indicates that PostgreSQL fails to start after pacemaker/corosync are
> restarted (on d-gp2-dbpg0-1 in this case), but it does not seem to
> give any reason as to why.  When I look on the failed node, I see
> that the PostgreSQL log is not being appended, so it doesn't seem
> it's ever actually trying to start it.  I'm not sure where else I
> could try looking.
> 
> Strangely, if prior to running `pcs cluster start` on the rebooted
> node, I sudo to postgres, copy the recovery.conf template to the data
> directory, and use pg_ctl to start the database, it comes up just
> fine in standby mode.  Then if I do `pcs cluster start`, the node
> rejoins the cluster just fine without any problem.
> 
> Can you tell me why pacemaker is failing to start PostgreSQL in
> standby mode based on the log data below, or how I can dig deeper
> into what is going on?  Is this due to some misconfiguration on my
> part?  I thought that PAF would try to do exactly what I do manually,
> but it doesn't seem this is the case...
> 
> Actually, why is Pacemaker fencing the standby node just because the
> resource fails to start there?  I thought only the master should be
> fenced if it were assumed to be broken.
> 
> Thank you for any help you can provide,

Pacemaker isn't fencing because the start failed, at least not
directly:

> May 22 23:57:24 [2196] d-gp2-dbpg0-2    pengine:     info:
> determine_op_status: Operation monitor found resource postgresql-10-
> main:2 active on d-gp2-dbpg0-2

> May 22 23:57:24 [2196] d-gp2-dbpg0-2    pengine:   notice:
> LogActions:  Demote  postgresql-10-main:1    (Master -> Slave d-gp2-
> dbpg0-1)
> May 22 23:57:24 [2196] d-gp2-dbpg0-2    pengine:   notice:
> LogActions:  Recover postgresql-10-main:1    (Master d-gp2-dbpg0-1)

>From the above, we can see that the initial probe after the node
rejoined found that the resource was already running in master mode
there (at least, that's what the agent thinks). So, the cluster wants
to demote it, stop it, and start it again as a slave.

> May 22 23:57:24 [2197] d-gp2-dbpg0-2       crmd:   notice:
> abort_transition_graph:      Transition aborted by postgresql-10-
> main_demote_0 'modify' on d-gp2-dbpg0-1: Event failed
> (magic=0:1;13:27:0:0df60493-9320-463d-94ca-a9515d139f9f, cib=0.35.70,
> source=match_graph_event:381, 0)

But the demote failed

> May 22 23:57:24 [2196] d-gp2-dbpg0-2    pengine:   notice:
> LogActions:  Stop    postgresql-10-main:1    (d-gp2-dbpg0-1)

So now the cluster wants to just stop it there

> May 22 23:57:24 [2197] d-gp2-dbpg0-2       crmd:   notice:
> abort_transition_graph:      Transition aborted by postgresql-10-
> main_stop_0 'modify' on d-gp2-dbpg0-1: Event failed
> (magic=0:1;2:28:0:0df60493-9320-463d-94ca-a9515d139f9f, cib=0.35.74,
> source=match_graph_event:381, 0)

But the stop fails too

> May 22 23:57:24 [2196] d-gp2-dbpg0-2    pengine:  warning:
> pe_fence_node:       Node d-gp2-dbpg0-1 will be fenced because of
> resource failure(s)

which is why the cluster then wants to fence the node. (If a resource
won't stop, the only way to recover it is to kill the entire node.)
-- 
Ken Gaillot <kgaillot at redhat.com>