[ClusterLabs] Replicated PGSQL woes

Fri Oct 14 16:10:08 UTC 2016

On Oct 14, 2016, at 1:39 AM, Jehan-Guillaume de Rorthais <jgdr at dalibo.com> wrote:
> 
> On Thu, 13 Oct 2016 14:11:06 -0800
> Israel Brewster <israel at ravnalaska.net> wrote:
> 
>> On Oct 13, 2016, at 1:56 PM, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
>> wrote:
>>> 
>>> On Thu, 13 Oct 2016 10:05:33 -0800
>>> Israel Brewster <israel at ravnalaska.net> wrote:
>>> 
>>>> On Oct 13, 2016, at 9:41 AM, Ken Gaillot <kgaillot at redhat.com> wrote:  
>>>>> 
>>>>> On 10/13/2016 12:04 PM, Israel Brewster wrote:    
>>> [...]
>>> 
>>>>>> But whatever- this is a cluster, it doesn't really matter which node
>>>>>> things are running on, as long as they are running. So the cluster is
>>>>>> working - postgresql starts, the master process is on the same node as
>>>>>> the IP, you can connect, etc, everything looks good. Obviously the next
>>>>>> thing to try is failover - should the master node fail, the slave node
>>>>>> should be promoted to master. So I try testing this by shutting down the
>>>>>> cluster on the primary server: "pcs cluster stop"
>>>>>> ...and nothing happens. The master shuts down (uncleanly, I might add -
>>>>>> it leaves behind a lock file that prevents it from starting again until
>>>>>> I manually remove said lock file), but the slave is never promoted to    
>>>>> 
>>>>> This definitely needs to be corrected. What creates the lock file, and
>>>>> how is that entity managed?    
>>>> 
>>>> The lock file entity is created/managed by the postgresql process itself.
>>>> On launch, postgres creates the lock file to say it is running, and
>>>> deletes said lock file when it shuts down. To my understanding, its role
>>>> in life is to prevent a restart after an unclean shutdown so the admin is
>>>> reminded to make sure that the data is in a consistent state before
>>>> starting the server again.  
>>> 
>>> What is the name of this lock file? Where is it?
>>> 
>>> PostgreSQL does not create lock file. It creates a "postmaster.pid" file,
>>> but it does not forbid a startup if the new process doesn't find another
>>> process with the pid and shm shown in the postmaster.pid.
>>> 
>>> As far as I know, the pgsql resource agent create such a lock file on
>>> promote and delete it on graceful stop. If the PostgreSQL instance couldn't
>>> be stopped correctly, the lock files stays and the RA refuse to start it
>>> the next time.  
>> 
>> Ah, you're right. Looking auth the RA I see where it creates the file in
>> question. The delete appears to be in the pgsql_real_stop() function (which
>> makes sense), wrapped in an if block that checks for $1 being master and
>> $OCF_RESKEY_CRM_meta_notify_slave_uname being a space. Throwing a little
>> debugging code in there I see that when it hits that block on a cluster stop,
>> $OCF_RESKEY_CRM_meta_notify_slave_uname is centtest1.ravnalaska.net
>> <http://centtest1.ravnalaska.net/>, not a space, so the lock file is not
>> removed:
>> 
>>    if  [ "$1" = "master" -a "$OCF_RESKEY_CRM_meta_notify_slave_uname" = " "
>> ]; then ocf_log info "Removing $PGSQL_LOCK."
>>        rm -f $PGSQL_LOCK
>>    fi 
>> 
>> It doesn't look like there is anywhere else where the file would be removed.
> 
> This is quite wrong to me for two reasons (I'll try to be clear):
> 
> 1) the resource agent (RA) make sure the timeline (TL) will not be incremented
> during promotion.
> 
> As there is no documentation about that, I'm pretty sure this contortion comes
> from limitations in very old versions of PostgreSQL (<= 9.1):
> 
>  * a slave wasn't able to cross a timeline (TL) from streaming replication,
>    only from WAL archives. That means crossing a TL was requiring to restart
>    the slave or cutting the streaming rep temporary to force it to get back to
>    the archives
>  * moreover, it was possible a standby miss some transactions on after a clean
>    master shutdown. That means the old master couldn't get back to the
>    cluster as a slave safely, as the TL is still the same...
> 
> See slide 35->37: http://www.slideshare.net/takmatsuo/2012929-pg-study-16012253
> 
> In my understanding, that's why we make sure there's no slave around before
> shutting down the master: should the master go back later cleanly, we make sure
> no one could be promoted in the meantime.
> 
> Note that considering this issue and how the RA tries to avoid it, this test on
> slave being shutdown before master is quite weak anyway...
> 
> Last but not least, the two PostgreSQL limitations the RA is messing with have
> been fixed long time ago in 9.3:
>  * https://www.postgresql.org/docs/current/static/release-9-3.html#AEN138909
>  *
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=985bd7d49726c9f178558491d31a570d47340459
> 
> ...but it requires PostgreSQL 9.3+ for the timeline issue. By the way, I suspect
> this is related to the "restart_on_promote" parameter of the RA.
> 
> 2) from a recent discussion on this list (or maybe on -dev), RA devs should not
> rely on OCF_RESKEY_CRM_meta_notify_* vars outside of "notify" actions.
> 
>>> [...]  
>>>>>> What can I do to fix this? What troubleshooting steps can I follow?
>>>>>> Thanks.  
>>> 
>>> I can not find the result of the stop operation in your log files, maybe the
>>> log from CentTest2 would be more useful.  
>> 
>> Sure. I was looking at centtest1 because I was trying to figure out why it
>> wouldn't promote, but if centtest2 never really stopped (properly) that could
>> explain things. Here's the log from 2 when calling pcs cluster stop:
>> 
>> [log log log]
> 
> Well tis is a normal shutdown and the master was shutdown cleanly. AS you
> pointed out, the lock file stayed there because some slaves were still up.
> 
> I **guess** if you really want a shutdown to occurs, you need to simulate a real
> failure, not shutting down the first node cleanly. Try to kill corosync.

From an academic standpoint the result of that test (which, incidentally, were the same as the results of every other test I've done) are interesting, however from a practical standpoint I'm not sure it helps much - most of the "failures" that I experience are intentional: I want to fail over to the other machine so I can run some software updates, reboot for whatever reason, shutdown temporarily to upgrade the hardware, or whatever.  While handling "real" failures is, of course, the real purpose of HA, that type of failure should be pretty rare. I would hope :-)

> 
>>> but I can find this:
>>> 
>>> Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: Scheduling Node
>>> centtest2.ravnalaska.net for shutdown
>>> ...
>>> Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: Scheduling Node
>>> centtest2.ravnalaska.net for shutdown
>>> 
>>> Which means the stop operation probably raised an error, leading to a
>>> fencing of the node. In this circumstance, I bet PostgreSQL wasn't able to
>>> stop correctly and the lock file stayed in place.
>>> 
>>> Could you please show us your full cluster setup?  
>> 
>> Sure: how? pcs status shows this, but I suspect that's not what you are
>> asking about:
> 
> "pcs config" would do the trick.

Here we go:

Cluster Name: cluster_test
Corosync Nodes:
 centtest1.ravnalaska.net centtest2.ravnalaska.net
Pacemaker Nodes:
 centtest1.ravnalaska.net centtest2.ravnalaska.net

Resources:
 Resource: virtual_ip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.211.55.200 iflabel=pg0
  Operations: start interval=0s timeout=20s (virtual_ip-start-interval-0s)
              stop interval=0s timeout=20s (virtual_ip-stop-interval-0s)
              monitor interval=30s (virtual_ip-monitor-interval-30s)
 Master: msPostgresql
  Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true 
  Resource: pgsql_96 (class=ocf provider=heartbeat type=pgsql)
   Attributes: pgctl=/usr/pgsql-9.6/bin/pg_ctl logfile=/var/log/pgsql/test2.log psql=/usr/pgsql-9.6/bin/psql pgdata=/pgsql96/data rep_mode=async repuser=postgres node_list="centtest2.ravnalaska.net centest1.ravnalaska.net" master_ip=10.211.55.200 archive_cleanup_command= restart_on_promote=true replication_slot_name=centtest_2_slot monitor_user=postgres monitor_password=SuperSecret
   Operations: start on-fail=restart interval=0s timeout=60s (pgsql_96-start-interval-0s)
               monitor on-fail=restart interval=4s timeout=60s (pgsql_96-monitor-interval-4s)
               monitor interval=3s role=Master timeout=60s on-fail=restart (pgsql_96-monitor-interval-3s)
               promote on-fail=restart interval=0s timeout=60s (pgsql_96-promote-interval-0s)
               demote on-fail=stop interval=0s timeout=60s (pgsql_96-demote-interval-0s)
               stop on-fail=block interval=0s timeout=60s (pgsql_96-stop-interval-0s)
               notify interval=0s timeout=60s (pgsql_96-notify-interval-0s)

Stonith Devices:
Fencing Levels:

Location Constraints:
  Resource: virtual_ip
    Enabled on: centtest2.ravnalaska.net (score:50) (id:location-virtual_ip-centtest2.ravnalaska.net-50)
Ordering Constraints:
  promote msPostgresql then start virtual_ip (score:INFINITY) (non-symmetrical) (id:order-msPostgresql-virtual_ip-INFINITY)
  demote msPostgresql then stop virtual_ip (score:0) (non-symmetrical) (id:order-msPostgresql-virtual_ip-0)
Colocation Constraints:
  virtual_ip with msPostgresql (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-virtual_ip-msPostgresql-INFINITY)

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: cman
 dc-version: 1.1.14-8.el6_8.1-70404b0
 have-watchdog: false
 last-lrm-refresh: 1476461302
 maintenance-mode: false
 no-quorum-policy: ignore
 stonith-enabled: false
Node Attributes:
 centtest1.ravnalaska.net: pgsql_96-data-status=DISCONNECT
 centtest2.ravnalaska.net: pgsql_96-data-status=LATEST

I find this line particularly interesting:  centtest1.ravnalaska.net: pgsql_96-data-status=DISCONNECT, especially since it is completely wrong. centtest1 *is* connected and replicating. Does potentially explain some things though.

Thanks

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------
> 
> 
> -- 
> Jehan-Guillaume de Rorthais
> Dalibo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20161014/7d61e084/attachment-0002.html>