[ClusterLabs] Antw: Re: PostgreSQL PAF failover issue

Mon Jul 1 05:13:49 EDT 2019

>>> Tiemen Ruiten <t.ruiten at rdmedia.com> schrieb am 14.06.2019 um 16:43 in
Nachricht
<CAAegNz2TY9Z_L6g9CNP+iXjH0bHjha0-maZhLexaJx7UAp-ZDQ at mail.gmail.com>:
> Right, so I may have been too fast to give up. I set maintenance mode back
> on and promoted ph-sql-04 manually. Unfortunately I don't have the logs of
> ph-sql-03 anymore because I reinitialized it.
> 
> You mention that demote timeout should be start timeout + stop timeout.
> Start/stop are 60s, so that would mean 120s for demote timeout? Or 30s for
> start/stop?

Timeout values always depend on your specific configuration, so general values cannot be given. I suggest to time the operations once (maybe with a very large timeout), then adjust the timeout to the value measured times some safety factor like 1.5 or even 3. Of course it all depends: If a fenceing and restart including recovery is faster tha waiting for an extraordinarily slow stop, you may prefer having a shorter timeout value. As said before: It all depends...

Sorry for the late response, BTW.

Ulrich

> 
> 
> 
> 
> On Fri, 14 Jun 2019 at 15:55, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
> wrote:
> 
>> On Fri, 14 Jun 2019 13:18:09 +0200
>> Tiemen Ruiten <t.ruiten at rdmedia.com> wrote:
>>
>> > Thank you, useful advice!
>> >
>> > Logs are attached, they cover the period between when I set
>> > maintenance-mode=false till after the node fencing.
>>
>> Switchover started @ 09:51:43
>>
>> In fact, the action that timed out was the demote action, not the stop
>> action:
>>
>>   pgsqld_demote_0:31997 - timed out after 30000ms
>>
>> As explained, the demote is doing a stop/start because PgSQL doesn't
>> support hot
>> demotion. So your demote action should be stop timeout+start timeout. I
>> would
>> recommend 60s there instead of 30s.
>>
>> After Pacemaker decide what to do next, you had some more timeouts. I
>> supose
>> PgSQL logs should give some more explanation of what happen during these
>> long
>> minutes
>>
>>   pgsqld_notify_0:37945 - timed out after 60000ms
>>   ...
>>   pgsqld_stop_0:7783 - timed out after 60000ms
>>
>> It is 09:54:16. Now pengine become angry and want to make sure pgsql is
>> stopped
>> on node 03:
>>
>>   pengine:  warning: unpack_rsc_op_failure:   Processing failed stop of
>> pgsqld:1
>>     on ph-sql-03: unknown error | rc=1
>>   ...
>>   pengine:  warning: pe_fence_node:   Cluster node ph-sql-03 will be
>> fenced:
>>     pgsqld:1 failed there
>>   ...
>>   pengine:  warning: stage6:  Scheduling Node ph-sql-03 for STONITH
>>   ...
>>   pengine:   notice: native_stop_constraints: Stop of failed resource
>> pgsqld:1
>>     is implicit after ph-sql-03 is fenced
>>
>>
>> From there node 03 is down for 9 minutes, it comes back at 10:02:59.
>>
>> Meanwhile, @ 09:54:29, node 5 took over the DC role and decided to promote
>> pgsql
>> on node 4 as expected.
>>
>> The pre-promote notify actions are triggered, but at 09:55:24, the
>> transition is canceled because of maintenance mode:
>>
>>   Transition aborted by cib-bootstrap-options-maintenance-mode doing modify
>>    maintenance-mode=true
>>
>> Soon after, both notify actions timed out on both nodes:
>>
>>   warning: child_timeout_callback:  pgsqld_notify_0 process (PID 38838)
>> timed
>>   out
>>
>> Not sure what happen on your side that could explain these timeouts, but
>> because the cluster was in maintenance mode, there was a human interaction
>> ingoing anyway.
>>
>>
>>
>>
>>
>>
>> > On Fri, 14 Jun 2019 at 12:48, Jehan-Guillaume de Rorthais <
>> jgdr at dalibo.com>
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > On Fri, 14 Jun 2019 12:27:12 +0200
>> > > Tiemen Ruiten <t.ruiten at rdmedia.com> wrote:
>> > > > I setup a new 3-node PostgreSQL cluster with HA managed by PAF.
>> Nodes are
>> > > > named ph-sql-03, ph-sql-04, ph-sql-05. Archive mode is on and writing
>> > > > archive files to an NFS share that's mounted on all nodes using
>> > > pgBackRest.
>> > > >
>> > > > What I did:
>> > > > - Create a pacemaker cluster, cib.xml is attached.
>> > > > - Set maintenance-mode=true in pacemaker
>> > >
>> > > This is not required. Just build your PgSQL replication, shut down the
>> > > instances, then add the PAF resource to the cluster.
>> > >
>> > > But it's not very important here.
>> > >
>> > > > - Bring up ph-sql-03 with pg_ctl start
>> > > > - Take a pg_basebackup on ph-sql-04 and ph-sql-05
>> > > > - Create a recovery.conf on ph-sql-04 and ph-sql-05:
>> > > >
>> > > > standby_mode = 'on'
>> > > > primary_conninfo = 'user=replication password=XXXXXXXXXXXXXXXX
>> > > > application_name=ph-sql-0x host=10.100.130.20 port=5432
>> sslmode=prefer
>> > > > sslcompression=0 krbsrvname=postgres target_session_attrs=any'
>> > > > recovery_target_timeline = 'latest'
>> > > > restore_command = 'pgbackrest --stanza=pgdb2 archive-get %f "%p"'
>> > >
>> > > Sounds fine.
>> > >
>> > > > - Bring up ph-sql-04 and ph-sql-05 and let recovery finish
>> > > > - Set maintenance-mode=false in pacemaker
>> > > > - Cluster is now running with ph-sql-03 as master and ph-sql-04/5
>> as
>> > > slaves
>> > > > At this point I tried a manual failover:
>> > > > - pcs resource move --wait --master pgsql-ha ph-sql-04
>> > > > Contrary to my expectations, pacemaker attempted to stop psqld on
>> > > > ph-sql-03.
>> > >
>> > > Indeed. PostgreSQL doesn't support hot-demote. It has to be shut
>> downed and
>> > > started as a standby.
>> > >
>> > > > This took longer than the configured timeout of 60s (checkpoint
>> > > > hadn't completed yet) and the node was fenced.
>> > >
>> > > 60s of checkpoint during a maintenance window? That's important
>> indeed. I
>> > > would
>> > > command doing a manual checkpoint before triggering the
>> move/switchover.
>> > >
>> > > > Then I ended up with
>> > > > ph-sql-04 and ph-sql-05 both in slave mode and ph-sql-03 rebooting.
>> > > >
>> > > >  Master: pgsql-ha
>> > > >   Meta Attrs: notify=true
>> > > >   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
>> > > >    Attributes: bindir=/usr/pgsql-11/bin pgdata=/var/lib/pgsql/11/data
>> > > > recovery_template=/var/lib/pgsql/recovery.conf.pcmk
>> > > >    Operations: demote interval=0s timeout=30s
>> (pgsqld-demote-interval-0s)
>> > > >                methods interval=0s timeout=5
>> (pgsqld-methods-interval-0s)
>> > > >                monitor interval=15s role=Master timeout=10s
>> > > > (pgsqld-monitor-interval-15s)
>> > > >                monitor interval=16s role=Slave timeout=10s
>> > > > (pgsqld-monitor-interval-16s)
>> > > >                notify interval=0s timeout=60s
>> (pgsqld-notify-interval-0s)
>> > > >                promote interval=0s timeout=30s
>> > > (pgsqld-promote-interval-0s)
>> > > >                reload interval=0s timeout=20
>> (pgsqld-reload-interval-0s)
>> > > >                start interval=0s timeout=60s
>> (pgsqld-start-interval-0s)
>> > > >                stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
>> > > >
>> > > > I understand I should at least increase the timeout of the stop
>> operation
>> > > > for psqld, though I'm not sure how much. Checkpoints can take up to
>> 15
>> > > > minutes to complete on this cluster. So is 20 minutes reasonable?
>> > >
>> > > 20 minutes is not reasonable for HA. 2 minutes is for manual procedure.
>> > > Timeout are here so the cluster knows how to react during unexpected
>> > > failure.
>> > > Not during maintenance.
>> > >
>> > > As I wrote, just add a manual checkpoint in your switchover procedure
>> > > before
>> > > the actual move.
>> > >
>> > > > Any other operations I should increase the timeouts for?
>> > > >
>> > > > Why didn't pacemaker elect and promote one of the other nodes?
>> > >
>> > > Do you have logs of all nodes during this time period?
>> > >
>> > >
>> >
>>
>>
>>
>> --
>> Jehan-Guillaume de Rorthais
>> Dalibo
>>
> 
> 
> -- 
> Tiemen Ruiten
> Systems Engineer
> R&D Media