[ClusterLabs] Postgres streaming VIP-REP not coming up on slave
NAKAHIRA Kazutomo
nakahira_kazutomo_b1 at lab.ntt.co.jp
Wed Mar 18 01:58:58 UTC 2015
Hi,
As Brestan pointed out, old master can not come up as a slave is
expected feature.
BTW, this action is different from the original problem.
It seems from logs, promote action succeeded in the cl2_lb1 after power
off cl1_lb1.
Was the original problem resolved?
And cl2_lb1's postgresql.conf has the following problem.
2015-03-17 07:34:28 SAST DETAIL: The failed archive command was: cp
pg_xlog/0000001D00000008000000C2
172.16.0.5:/pgtablespace/archive/0000001D00000008000000C2
"172.16.0.5" must be eliminated from the archive_command directive in
the postgresql.conf.
Best regards,
Kazutomo NAKAHIRA
On 2015/03/18 5:00, Rainer Brestan wrote:
> Yes, thats the expected behaviour.
> Takatoshi Matsuo describes in his papers, why a former master cant come up as
> slave without possible data corruption.
> And you do not get an indication from Postgres that the data on disk is corrupted.
> Therefore, he created the lock file mechanism to prevent a former master to
> start up.
> Making the base backup from Master discards any possibly wrong data from the
> slave and the removed lock files indicates this for the resource agent.
> To shorten the discussion about "how this can be automated within the resource
> agent", there is no clean way of handling this with very large databases, for
> which this can take hours.
> And what you should do is making the base backup in a temporary directory and
> then renaming this to the name Postgres instance requires after base backup
> finish successful (yes, this requires twice of harddisk space). Otherwise you
> might loose everything, when your master brakes during base backup.
> Rainer
> *Gesendet:* Dienstag, 17. März 2015 um 12:16 Uhr
> *Von:* "Wynand Jansen van Vuuren" <esawyja at gmail.com>
> *An:* "Cluster Labs - All topics related to open-source clustering welcomed"
> <users at clusterlabs.org>
> *Betreff:* Re: [ClusterLabs] Postgres streaming VIP-REP not coming up on slave
> Hi
> Ok I found this particular problem, when the failed node comes up again, the
> kernel start Postgres, I have disabled this and now the VIPs and Postgres remain
> on the new MASTER, but the failed node does not come up as a slave, IE there is
> no sync between the new master and slave, is this the expected behavior? The
> only way I can get it back into slave mode is to follow the procedure in the wiki
>
> # su - postgres
> $ rm -rf /var/lib/pgsql/data/
> $ pg_basebackup -h 192.168.2.3 -U postgres -D /var/lib/pgsql/data -X stream -P
> $ rm /var/lib/pgsql/tmp/PGSQL.lock
> $ exit
> # pcs resource cleanup msPostgresql
>
> Looking forward to your reply please
> Regards
> On Tue, Mar 17, 2015 at 7:55 AM, Wynand Jansen van Vuuren <esawyja at gmail.com>
> wrote:
>
> Hi Nakahira,
> I finally got around testing this, below is the initial state
>
> cl1_lb1:~ # crm_mon -1 -Af
> Last updated: Tue Mar 17 07:31:58 2015
> Last change: Tue Mar 17 07:31:12 2015 by root via crm_attribute on cl1_lb1
> Stack: classic openais (with plugin)
> Current DC: cl1_lb1 - partition with quorum
> Version: 1.1.9-2db99f1
> 2 Nodes configured, 2 expected votes
> 6 Resources configured.
>
>
> Online: [ cl1_lb1 cl2_lb1 ]
>
> Resource Group: master-group
> vip-master (ocf::heartbeat:IPaddr2): Started cl1_lb1
> vip-rep (ocf::heartbeat:IPaddr2): Started cl1_lb1
> CBC_instance (ocf::heartbeat:cbc): Started cl1_lb1
> failover_MailTo (ocf::heartbeat:MailTo): Started cl1_lb1
> Master/Slave Set: msPostgresql [pgsql]
> Masters: [ cl1_lb1 ]
> Slaves: [ cl2_lb1 ]
>
> Node Attributes:
> * Node cl1_lb1:
> + master-pgsql : 1000
> + pgsql-data-status : LATEST
> + pgsql-master-baseline : 00000008BE000000
> + pgsql-status : PRI
> * Node cl2_lb1:
> + master-pgsql : 100
> + pgsql-data-status : STREAMING|SYNC
> + pgsql-status : HS:sync
>
> Migration summary:
> * Node cl2_lb1:
> * Node cl1_lb1:
> cl1_lb1:~ #
> ###### - I then did a init 0 on the master node, cl1_lb1
>
> cl1_lb1:~ # init 0
> cl1_lb1:~ #
> Connection closed by foreign host.
>
> Disconnected from remote host(cl1_lb1) at 07:36:18.
>
> Type `help' to learn how to use Xshell prompt.
> [c:\~]$
> ###### - This was ok as the slave took over, became master
>
> cl2_lb1:~ # crm_mon -1 -Af
> Last updated: Tue Mar 17 07:35:04 2015
> Last change: Tue Mar 17 07:34:29 2015 by root via crm_attribute on cl2_lb1
> Stack: classic openais (with plugin)
> Current DC: cl2_lb1 - partition WITHOUT quorum
> Version: 1.1.9-2db99f1
> 2 Nodes configured, 2 expected votes
> 6 Resources configured.
>
>
> Online: [ cl2_lb1 ]
> OFFLINE: [ cl1_lb1 ]
>
> Resource Group: master-group
> vip-master (ocf::heartbeat:IPaddr2): Started cl2_lb1
> vip-rep (ocf::heartbeat:IPaddr2): Started cl2_lb1
> CBC_instance (ocf::heartbeat:cbc): Started cl2_lb1
> failover_MailTo (ocf::heartbeat:MailTo): Started cl2_lb1
> Master/Slave Set: msPostgresql [pgsql]
> Masters: [ cl2_lb1 ]
> Stopped: [ pgsql:1 ]
>
> Node Attributes:
> * Node cl2_lb1:
> + master-pgsql : 1000
> + pgsql-data-status : LATEST
> + pgsql-master-baseline : 00000008C2000090
> + pgsql-status : PRI
>
> Migration summary:
> * Node cl2_lb1:
> cl2_lb1:~ #
> And the logs from Postgres and Corosync are attached
> ###### - I then restarted the original Master cl1_lb1 and started Corosync
> manually
> Once the original Master cl1_lb1 was up and Corosync running, the status
> below happened, notice no VIPs and Postgres
> ###### - Still working below
>
> cl2_lb1:~ # crm_mon -1 -Af
> Last updated: Tue Mar 17 07:36:55 2015
> Last change: Tue Mar 17 07:34:29 2015 by root via crm_attribute on cl2_lb1
> Stack: classic openais (with plugin)
> Current DC: cl2_lb1 - partition WITHOUT quorum
> Version: 1.1.9-2db99f1
> 2 Nodes configured, 2 expected votes
> 6 Resources configured.
>
>
> Online: [ cl2_lb1 ]
> OFFLINE: [ cl1_lb1 ]
>
> Resource Group: master-group
> vip-master (ocf::heartbeat:IPaddr2): Started cl2_lb1
> vip-rep (ocf::heartbeat:IPaddr2): Started cl2_lb1
> CBC_instance (ocf::heartbeat:cbc): Started cl2_lb1
> failover_MailTo (ocf::heartbeat:MailTo): Started cl2_lb1
> Master/Slave Set: msPostgresql [pgsql]
> Masters: [ cl2_lb1 ]
> Stopped: [ pgsql:1 ]
>
> Node Attributes:
> * Node cl2_lb1:
> + master-pgsql : 1000
> + pgsql-data-status : LATEST
> + pgsql-master-baseline : 00000008C2000090
> + pgsql-status : PRI
>
> Migration summary:
> * Node cl2_lb1:
>
> ###### - After original master is up and Corosync running on cl1_lb1
>
> cl2_lb1:~ # crm_mon -1 -Af
> Last updated: Tue Mar 17 07:37:47 2015
> Last change: Tue Mar 17 07:37:21 2015 by root via crm_attribute on cl1_lb1
> Stack: classic openais (with plugin)
> Current DC: cl2_lb1 - partition with quorum
> Version: 1.1.9-2db99f1
> 2 Nodes configured, 2 expected votes
> 6 Resources configured.
>
>
> Online: [ cl1_lb1 cl2_lb1 ]
>
>
> Node Attributes:
> * Node cl1_lb1:
> + master-pgsql : -INFINITY
> + pgsql-data-status : LATEST
> + pgsql-status : STOP
> * Node cl2_lb1:
> + master-pgsql : -INFINITY
> + pgsql-data-status : DISCONNECT
> + pgsql-status : STOP
>
> Migration summary:
> * Node cl2_lb1:
> pgsql:0: migration-threshold=1 fail-count=2 last-failure='Tue Mar 17
> 07:37:26 2015'
> * Node cl1_lb1:
> pgsql:0: migration-threshold=1 fail-count=2 last-failure='Tue Mar 17
> 07:37:26 2015'
>
> Failed actions:
> pgsql_monitor_4000 (node=cl2_lb1, call=735, rc=7, status=complete): not
> running
> pgsql_monitor_4000 (node=cl1_lb1, call=42, rc=7, status=complete): not
> running
> cl2_lb1:~ #
> ##### - No VIPs up
>
> cl2_lb1:~ # ping 172.28.200.159
> PING 172.28.200.159 (172.28.200.159) 56(84) bytes of data.
> >From 172.28.200.168 <http://172.28.200.168>: icmp_seq=1 Destination Host
> Unreachable
> >From 172.28.200.168 icmp_seq=1 Destination Host Unreachable
> >From 172.28.200.168 icmp_seq=2 Destination Host Unreachable
> >From 172.28.200.168 icmp_seq=3 Destination Host Unreachable
> ^C
> --- 172.28.200.159 ping statistics ---
> 5 packets transmitted, 0 received, +4 errors, 100% packet loss, time 4024ms
> , pipe 3
> cl2_lb1:~ # ping 172.16.0.5
> PING 172.16.0.5 (172.16.0.5) 56(84) bytes of data.
> >From 172.16.0.3 <http://172.16.0.3>: icmp_seq=1 Destination Host Unreachable
> >From 172.16.0.3 icmp_seq=1 Destination Host Unreachable
> >From 172.16.0.3 icmp_seq=2 Destination Host Unreachable
> >From 172.16.0.3 icmp_seq=3 Destination Host Unreachable
> >From 172.16.0.3 icmp_seq=5 Destination Host Unreachable
> >From 172.16.0.3 icmp_seq=6 Destination Host Unreachable
> >From 172.16.0.3 icmp_seq=7 Destination Host Unreachable
> ^C
> --- 172.16.0.5 ping statistics ---
> 8 packets transmitted, 0 received, +7 errors, 100% packet loss, time 7015ms
> , pipe 3
> cl2_lb1:~ #
>
> Any ideas please, or it it a case of recovering the original master manually
> before starting Corosync etc?
> All logs are attached
> Regards
> On Mon, Mar 16, 2015 at 11:01 AM, Wynand Jansen van Vuuren
> <esawyja at gmail.com> wrote:
>
> Thanks for the advice, I have a demo on this now, so I don't want to
> test this now, I will do so tomorrow and forwards the logs, many thanks!!
> On Mon, Mar 16, 2015 at 10:54 AM, NAKAHIRA Kazutomo
> <nakahira_kazutomo_b1 at lab.ntt.co.jp> wrote:
>
> Hi,
>
> > do you suggest that I take it out? or should I look at the problem where
> > cl2_lb1 is not being promoted?
>
> You should look at the problem where cl2_lb1 is not being promoted.
> And I look it if you send me a ha-log and PostgreSQL's log.
>
> Best regards,
> Kazutomo NAKAHIRA
>
>
> On 2015/03/16 17:18, Wynand Jansen van Vuuren wrote:
>
> Hi Nakahira,
> Thanks so much for the info, this setting was as the wiki page
> suggested,
> do you suggest that I take it out? or should I look at the
> problem where
> cl2_lb1 is not being promoted?
> Regards
>
> On Mon, Mar 16, 2015 at 10:15 AM, NAKAHIRA Kazutomo <
> nakahira_kazutomo_b1 at lab.ntt.co.jp> wrote:
>
> Hi,
>
> Notice there is no VIPs, looks like the VIPs depends on
> some other
>
> resource
>
> to start 1st?
>
>
> The following constraints means that "master-group" can not
> start
> without master of msPostgresql resource.
>
> colocation rsc_colocation-1 inf: master-group
> msPostgresql:Master
>
> After you power off cl1_lb1, msPostgresql on the cl2_lb1 is
> not promoted
> and master is not exist in your cluster.
>
> It means that "master-group" can not run anyware.
>
> Best regards,
> Kazutomo NAKAHIRA
>
>
> On 2015/03/16 16:48, Wynand Jansen van Vuuren wrote:
>
> Hi
> When I start out cl1_lb1 (Cluster 1 load balancer 1) is
> the master as
> below
> cl1_lb1:~ # crm_mon -1 -Af
> Last updated: Mon Mar 16 09:44:44 2015
> Last change: Mon Mar 16 08:06:26 2015 by root via
> crm_attribute on cl1_lb1
> Stack: classic openais (with plugin)
> Current DC: cl2_lb1 - partition with quorum
> Version: 1.1.9-2db99f1
> 2 Nodes configured, 2 expected votes
> 6 Resources configured.
>
>
> Online: [ cl1_lb1 cl2_lb1 ]
>
> Resource Group: master-group
> vip-master (ocf::heartbeat:IPaddr2):
> Started cl1_lb1
> vip-rep (ocf::heartbeat:IPaddr2): Started
> cl1_lb1
> CBC_instance (ocf::heartbeat:cbc): Started
> cl1_lb1
> failover_MailTo (ocf::heartbeat:MailTo):
> Started cl1_lb1
> Master/Slave Set: msPostgresql [pgsql]
> Masters: [ cl1_lb1 ]
> Slaves: [ cl2_lb1 ]
>
> Node Attributes:
> * Node cl1_lb1:
> + master-pgsql : 1000
> + pgsql-data-status : LATEST
> + pgsql-master-baseline :
> 00000008B90061F0
> + pgsql-status : PRI
> * Node cl2_lb1:
> + master-pgsql : 100
> + pgsql-data-status :
> STREAMING|SYNC
> + pgsql-status : HS:sync
>
> Migration summary:
> * Node cl2_lb1:
> * Node cl1_lb1:
> cl1_lb1:~ #
>
> If I then do a power off on cl1_lb1 (master), Postgres
> moves to cl2_lb1
> (Cluster 2 load balancer 1), but the VIP-MASTER and
> VIP-REP is not
> pingable
> from the NEW master (cl2_lb1), it stays line this below
> cl2_lb1:~ # crm_mon -1 -Af
> Last updated: Mon Mar 16 07:32:07 2015
> Last change: Mon Mar 16 07:28:53 2015 by root via
> crm_attribute on cl1_lb1
> Stack: classic openais (with plugin)
> Current DC: cl2_lb1 - partition WITHOUT quorum
> Version: 1.1.9-2db99f1
> 2 Nodes configured, 2 expected votes
> 6 Resources configured.
>
>
> Online: [ cl2_lb1 ]
> OFFLINE: [ cl1_lb1 ]
>
> Master/Slave Set: msPostgresql [pgsql]
> Slaves: [ cl2_lb1 ]
> Stopped: [ pgsql:1 ]
>
> Node Attributes:
> * Node cl2_lb1:
> + master-pgsql : -INFINITY
> + pgsql-data-status : DISCONNECT
> + pgsql-status : HS:alone
>
> Migration summary:
> * Node cl2_lb1:
> cl2_lb1:~ #
>
> Notice there is no VIPs, looks like the VIPs depends on
> some other
> resource
> to start 1st?
> Thanks for the reply!
>
>
> On Mon, Mar 16, 2015 at 9:42 AM, NAKAHIRA Kazutomo <
> nakahira_kazutomo_b1 at lab.ntt.co.jp> wrote:
>
> Hi,
>
>
> fine, cl2_lb1 takes over and acts as a slave, but
> the VIPs does not come
>
>
> cl2_lb1 acts as a slave? It is not a master?
> VIPs comes up with master msPostgresql resource.
>
> If promote action was failed in the cl2_lb1, then
> please send a ha-log and PostgreSQL's log.
>
> Best regards,
> Kazutomo NAKAHIRA
>
>
> On 2015/03/16 16:09, Wynand Jansen van Vuuren wrote:
>
> Hi all,
>
>
> I have 2 nodes, with 2 interfaces each, ETH0 is
> used for an application,
> CBC, that's writing to the Postgres DB on the
> VIP-MASTER 172.28.200.159,
> ETH1 is used for the Corosync configuration and
> VIP-REP, everything
> works,
> but if the master currently on cl1_lb1 has a
> catastrophic failure, like
> power down, the VIPs does not start on the
> slave, the Postgres parts
> works
> fine, cl2_lb1 takes over and acts as a slave,
> but the VIPs does not come
> up. If I test it manually, IE kill the
> application 3 times on the
> master,
> the switchover is smooth, same if I kill
> Postgres on master, but when
> there
> is a power failure on the Master, the VIPs stay
> down. If I then delete
> the
> attributes pgsql-data-status="LATEST" and attributes
> pgsql-data-status="STREAMING|SYNC" on the slave
> after power off on the
> master and restart everything, then the VIPs
> come up on the slave, any
> ideas please?
> I'm using this setup
> http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster
>
> With this configuration below
> node cl1_lb1 \
> attributes pgsql-data-status="LATEST"
> node cl2_lb1 \
> attributes
> pgsql-data-status="STREAMING|SYNC"
> primitive CBC_instance ocf:heartbeat:cbc \
> op monitor interval="60s"
> timeout="60s" on-fail="restart" \
> op start interval="0s" timeout="60s"
> on-fail="restart" \
> meta target-role="Started"
> migration-threshold="3"
> failure-timeout="60s"
> primitive failover_MailTo ocf:heartbeat:MailTo \
> params email="wynandj at rorotika.com"
> subject="Cluster Status
> change
> - " \
> op monitor interval="10"
> timeout="10" dept="0"
> primitive pgsql ocf:heartbeat:pgsql \
> params
> pgctl="/opt/app/PostgreSQL/9.3/bin/pg_ctl"
> psql="/opt/app/PostgreSQL/9.3/bin/psql"
> config="/opt/app/pgdata/9.3/postgresql.conf"
> pgdba="postgres"
> pgdata="/opt/app/pgdata/9.3/" start_opt="-p
> 5432" rep_mode="sync"
> node_list="cl1_lb1 cl2_lb1" restore_command="cp
> /pgtablespace/archive/%f
> %p" primary_conninfo_opt="keepalives_idle=60
> keepalives_interval=5
> keepalives_count=5" master_ip="172.16.0.5"
> restart_on_promote="false"
> logfile="/var/log/OCF.log" \
> op start interval="0s" timeout="60s"
> on-fail="restart" \
> op monitor interval="4s"
> timeout="60s" on-fail="restart" \
> op monitor interval="3s"
> role="Master" timeout="60s"
> on-fail="restart" \
> op promote interval="0s"
> timeout="60s" on-fail="restart" \
> op demote interval="0s"
> timeout="60s" on-fail="stop" \
> op stop interval="0s" timeout="60s"
> on-fail="block" \
> op notify interval="0s" timeout="60s"
> primitive vip-master ocf:heartbeat:IPaddr2 \
> params ip="172.28.200.159"
> nic="eth0" iflabel="CBC_VIP"
> cidr_netmask="24" \
> op start interval="0s" timeout="60s"
> on-fail="restart" \
> op monitor interval="10s"
> timeout="60s" on-fail="restart" \
> op stop interval="0s" timeout="60s"
> on-fail="block" \
> meta target-role="Started"
> primitive vip-rep ocf:heartbeat:IPaddr2 \
> params ip="172.16.0.5" nic="eth1"
> iflabel="REP_VIP"
> cidr_netmask="24" \
> meta migration-threshold="0"
> target-role="Started" \
> op start interval="0s" timeout="60s"
> on-fail="stop" \
> op monitor interval="10s"
> timeout="60s" on-fail="restart" \
> op stop interval="0s" timeout="60s"
> on-fail="restart"
> group master-group vip-master vip-rep
> CBC_instance failover_MailTo
> ms msPostgresql pgsql \
> meta master-max="1"
> master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> colocation rsc_colocation-1 inf: master-group
> msPostgresql:Master
> order rsc_order-1 0: msPostgresql:promote
> master-group:start
> symmetrical=false
> order rsc_order-2 0: msPostgresql:demote
> master-group:stop
> symmetrical=false
> property $id="cib-bootstrap-options" \
> dc-version="1.1.9-2db99f1" \
> cluster-infrastructure="classic
> openais (with plugin)" \
> expected-quorum-votes="2" \
> no-quorum-policy="ignore" \
> stonith-enabled="false" \
> cluster-recheck-interval="1min" \
> crmd-transition-delay="0s" \
> last-lrm-refresh="1426485983"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="INFINITY" \
> migration-threshold="1"
> #vim:set syntax=pcmk
>
> Any ideas please, I'm lost......
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/
> doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
> --
> NTT オープンソースソフトウェアセンタ
> 中平 和友
> TEL: 03-5860-5135 FAX: 03-5463-6490
> Mail: nakahira_kazutomo_b1 at lab.ntt.co.jp
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> --
> NTT オープンソースソフトウェアセンタ
> 中平 和友
> TEL: 03-5860-5135 FAX: 03-5463-6490
> Mail: nakahira_kazutomo_b1 at lab.ntt.co.jp
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________ Users mailing list:
> Users at clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project
> Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
> http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Users
mailing list