[ClusterLabs] Postgres streaming VIP-REP not coming up on slave
Wynand Jansen van Vuuren
esawyja at gmail.com
Tue Mar 17 05:55:40 UTC 2015
Hi Nakahira,
I finally got around testing this, below is the initial state
cl1_lb1:~ # crm_mon -1 -Af
Last updated: Tue Mar 17 07:31:58 2015
Last change: Tue Mar 17 07:31:12 2015 by root via crm_attribute on cl1_lb1
Stack: classic openais (with plugin)
Current DC: cl1_lb1 - partition with quorum
Version: 1.1.9-2db99f1
2 Nodes configured, 2 expected votes
6 Resources configured.
Online: [ cl1_lb1 cl2_lb1 ]
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started cl1_lb1
vip-rep (ocf::heartbeat:IPaddr2): Started cl1_lb1
CBC_instance (ocf::heartbeat:cbc): Started cl1_lb1
failover_MailTo (ocf::heartbeat:MailTo): Started cl1_lb1
Master/Slave Set: msPostgresql [pgsql]
Masters: [ cl1_lb1 ]
Slaves: [ cl2_lb1 ]
Node Attributes:
* Node cl1_lb1:
+ master-pgsql : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 00000008BE000000
+ pgsql-status : PRI
* Node cl2_lb1:
+ master-pgsql : 100
+ pgsql-data-status : STREAMING|SYNC
+ pgsql-status : HS:sync
Migration summary:
* Node cl2_lb1:
* Node cl1_lb1:
cl1_lb1:~ #
###### - I then did a init 0 on the master node, cl1_lb1
cl1_lb1:~ # init 0
cl1_lb1:~ #
Connection closed by foreign host.
Disconnected from remote host(cl1_lb1) at 07:36:18.
Type `help' to learn how to use Xshell prompt.
[c:\~]$
###### - This was ok as the slave took over, became master
cl2_lb1:~ # crm_mon -1 -Af
Last updated: Tue Mar 17 07:35:04 2015
Last change: Tue Mar 17 07:34:29 2015 by root via crm_attribute on cl2_lb1
Stack: classic openais (with plugin)
Current DC: cl2_lb1 - partition WITHOUT quorum
Version: 1.1.9-2db99f1
2 Nodes configured, 2 expected votes
6 Resources configured.
Online: [ cl2_lb1 ]
OFFLINE: [ cl1_lb1 ]
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started cl2_lb1
vip-rep (ocf::heartbeat:IPaddr2): Started cl2_lb1
CBC_instance (ocf::heartbeat:cbc): Started cl2_lb1
failover_MailTo (ocf::heartbeat:MailTo): Started cl2_lb1
Master/Slave Set: msPostgresql [pgsql]
Masters: [ cl2_lb1 ]
Stopped: [ pgsql:1 ]
Node Attributes:
* Node cl2_lb1:
+ master-pgsql : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 00000008C2000090
+ pgsql-status : PRI
Migration summary:
* Node cl2_lb1:
cl2_lb1:~ #
And the logs from Postgres and Corosync are attached
###### - I then restarted the original Master cl1_lb1 and started Corosync
manually
Once the original Master cl1_lb1 was up and Corosync running, the status
below happened, notice no VIPs and Postgres
###### - Still working below
cl2_lb1:~ # crm_mon -1 -Af
Last updated: Tue Mar 17 07:36:55 2015
Last change: Tue Mar 17 07:34:29 2015 by root via crm_attribute on cl2_lb1
Stack: classic openais (with plugin)
Current DC: cl2_lb1 - partition WITHOUT quorum
Version: 1.1.9-2db99f1
2 Nodes configured, 2 expected votes
6 Resources configured.
Online: [ cl2_lb1 ]
OFFLINE: [ cl1_lb1 ]
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started cl2_lb1
vip-rep (ocf::heartbeat:IPaddr2): Started cl2_lb1
CBC_instance (ocf::heartbeat:cbc): Started cl2_lb1
failover_MailTo (ocf::heartbeat:MailTo): Started cl2_lb1
Master/Slave Set: msPostgresql [pgsql]
Masters: [ cl2_lb1 ]
Stopped: [ pgsql:1 ]
Node Attributes:
* Node cl2_lb1:
+ master-pgsql : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 00000008C2000090
+ pgsql-status : PRI
Migration summary:
* Node cl2_lb1:
###### - After original master is up and Corosync running on cl1_lb1
cl2_lb1:~ # crm_mon -1 -Af
Last updated: Tue Mar 17 07:37:47 2015
Last change: Tue Mar 17 07:37:21 2015 by root via crm_attribute on cl1_lb1
Stack: classic openais (with plugin)
Current DC: cl2_lb1 - partition with quorum
Version: 1.1.9-2db99f1
2 Nodes configured, 2 expected votes
6 Resources configured.
Online: [ cl1_lb1 cl2_lb1 ]
Node Attributes:
* Node cl1_lb1:
+ master-pgsql : -INFINITY
+ pgsql-data-status : LATEST
+ pgsql-status : STOP
* Node cl2_lb1:
+ master-pgsql : -INFINITY
+ pgsql-data-status : DISCONNECT
+ pgsql-status : STOP
Migration summary:
* Node cl2_lb1:
pgsql:0: migration-threshold=1 fail-count=2 last-failure='Tue Mar 17
07:37:26 2015'
* Node cl1_lb1:
pgsql:0: migration-threshold=1 fail-count=2 last-failure='Tue Mar 17
07:37:26 2015'
Failed actions:
pgsql_monitor_4000 (node=cl2_lb1, call=735, rc=7, status=complete): not
running
pgsql_monitor_4000 (node=cl1_lb1, call=42, rc=7, status=complete): not
running
cl2_lb1:~ #
##### - No VIPs up
cl2_lb1:~ # ping 172.28.200.159
PING 172.28.200.159 (172.28.200.159) 56(84) bytes of data.
>From 172.28.200.168: icmp_seq=1 Destination Host Unreachable
>From 172.28.200.168 icmp_seq=1 Destination Host Unreachable
>From 172.28.200.168 icmp_seq=2 Destination Host Unreachable
>From 172.28.200.168 icmp_seq=3 Destination Host Unreachable
^C
--- 172.28.200.159 ping statistics ---
5 packets transmitted, 0 received, +4 errors, 100% packet loss, time 4024ms
, pipe 3
cl2_lb1:~ # ping 172.16.0.5
PING 172.16.0.5 (172.16.0.5) 56(84) bytes of data.
>From 172.16.0.3: icmp_seq=1 Destination Host Unreachable
>From 172.16.0.3 icmp_seq=1 Destination Host Unreachable
>From 172.16.0.3 icmp_seq=2 Destination Host Unreachable
>From 172.16.0.3 icmp_seq=3 Destination Host Unreachable
>From 172.16.0.3 icmp_seq=5 Destination Host Unreachable
>From 172.16.0.3 icmp_seq=6 Destination Host Unreachable
>From 172.16.0.3 icmp_seq=7 Destination Host Unreachable
^C
--- 172.16.0.5 ping statistics ---
8 packets transmitted, 0 received, +7 errors, 100% packet loss, time 7015ms
, pipe 3
cl2_lb1:~ #
Any ideas please, or it it a case of recovering the original master
manually before starting Corosync etc?
All logs are attached
Regards
On Mon, Mar 16, 2015 at 11:01 AM, Wynand Jansen van Vuuren <
esawyja at gmail.com> wrote:
> Thanks for the advice, I have a demo on this now, so I don't want to test
> this now, I will do so tomorrow and forwards the logs, many thanks!!
>
> On Mon, Mar 16, 2015 at 10:54 AM, NAKAHIRA Kazutomo <
> nakahira_kazutomo_b1 at lab.ntt.co.jp> wrote:
>
>> Hi,
>>
>> > do you suggest that I take it out? or should I look at the problem where
>> > cl2_lb1 is not being promoted?
>>
>> You should look at the problem where cl2_lb1 is not being promoted.
>> And I look it if you send me a ha-log and PostgreSQL's log.
>>
>> Best regards,
>> Kazutomo NAKAHIRA
>>
>>
>> On 2015/03/16 17:18, Wynand Jansen van Vuuren wrote:
>>
>>> Hi Nakahira,
>>> Thanks so much for the info, this setting was as the wiki page suggested,
>>> do you suggest that I take it out? or should I look at the problem where
>>> cl2_lb1 is not being promoted?
>>> Regards
>>>
>>> On Mon, Mar 16, 2015 at 10:15 AM, NAKAHIRA Kazutomo <
>>> nakahira_kazutomo_b1 at lab.ntt.co.jp> wrote:
>>>
>>> Hi,
>>>>
>>>> Notice there is no VIPs, looks like the VIPs depends on some other
>>>>>
>>>> resource
>>>>
>>>>> to start 1st?
>>>>>
>>>>
>>>> The following constraints means that "master-group" can not start
>>>> without master of msPostgresql resource.
>>>>
>>>> colocation rsc_colocation-1 inf: master-group msPostgresql:Master
>>>>
>>>> After you power off cl1_lb1, msPostgresql on the cl2_lb1 is not promoted
>>>> and master is not exist in your cluster.
>>>>
>>>> It means that "master-group" can not run anyware.
>>>>
>>>> Best regards,
>>>> Kazutomo NAKAHIRA
>>>>
>>>>
>>>> On 2015/03/16 16:48, Wynand Jansen van Vuuren wrote:
>>>>
>>>> Hi
>>>>> When I start out cl1_lb1 (Cluster 1 load balancer 1) is the master as
>>>>> below
>>>>> cl1_lb1:~ # crm_mon -1 -Af
>>>>> Last updated: Mon Mar 16 09:44:44 2015
>>>>> Last change: Mon Mar 16 08:06:26 2015 by root via crm_attribute on
>>>>> cl1_lb1
>>>>> Stack: classic openais (with plugin)
>>>>> Current DC: cl2_lb1 - partition with quorum
>>>>> Version: 1.1.9-2db99f1
>>>>> 2 Nodes configured, 2 expected votes
>>>>> 6 Resources configured.
>>>>>
>>>>>
>>>>> Online: [ cl1_lb1 cl2_lb1 ]
>>>>>
>>>>> Resource Group: master-group
>>>>> vip-master (ocf::heartbeat:IPaddr2): Started cl1_lb1
>>>>> vip-rep (ocf::heartbeat:IPaddr2): Started cl1_lb1
>>>>> CBC_instance (ocf::heartbeat:cbc): Started cl1_lb1
>>>>> failover_MailTo (ocf::heartbeat:MailTo): Started cl1_lb1
>>>>> Master/Slave Set: msPostgresql [pgsql]
>>>>> Masters: [ cl1_lb1 ]
>>>>> Slaves: [ cl2_lb1 ]
>>>>>
>>>>> Node Attributes:
>>>>> * Node cl1_lb1:
>>>>> + master-pgsql : 1000
>>>>> + pgsql-data-status : LATEST
>>>>> + pgsql-master-baseline : 00000008B90061F0
>>>>> + pgsql-status : PRI
>>>>> * Node cl2_lb1:
>>>>> + master-pgsql : 100
>>>>> + pgsql-data-status : STREAMING|SYNC
>>>>> + pgsql-status : HS:sync
>>>>>
>>>>> Migration summary:
>>>>> * Node cl2_lb1:
>>>>> * Node cl1_lb1:
>>>>> cl1_lb1:~ #
>>>>>
>>>>> If I then do a power off on cl1_lb1 (master), Postgres moves to cl2_lb1
>>>>> (Cluster 2 load balancer 1), but the VIP-MASTER and VIP-REP is not
>>>>> pingable
>>>>> from the NEW master (cl2_lb1), it stays line this below
>>>>> cl2_lb1:~ # crm_mon -1 -Af
>>>>> Last updated: Mon Mar 16 07:32:07 2015
>>>>> Last change: Mon Mar 16 07:28:53 2015 by root via crm_attribute on
>>>>> cl1_lb1
>>>>> Stack: classic openais (with plugin)
>>>>> Current DC: cl2_lb1 - partition WITHOUT quorum
>>>>> Version: 1.1.9-2db99f1
>>>>> 2 Nodes configured, 2 expected votes
>>>>> 6 Resources configured.
>>>>>
>>>>>
>>>>> Online: [ cl2_lb1 ]
>>>>> OFFLINE: [ cl1_lb1 ]
>>>>>
>>>>> Master/Slave Set: msPostgresql [pgsql]
>>>>> Slaves: [ cl2_lb1 ]
>>>>> Stopped: [ pgsql:1 ]
>>>>>
>>>>> Node Attributes:
>>>>> * Node cl2_lb1:
>>>>> + master-pgsql : -INFINITY
>>>>> + pgsql-data-status : DISCONNECT
>>>>> + pgsql-status : HS:alone
>>>>>
>>>>> Migration summary:
>>>>> * Node cl2_lb1:
>>>>> cl2_lb1:~ #
>>>>>
>>>>> Notice there is no VIPs, looks like the VIPs depends on some other
>>>>> resource
>>>>> to start 1st?
>>>>> Thanks for the reply!
>>>>>
>>>>>
>>>>> On Mon, Mar 16, 2015 at 9:42 AM, NAKAHIRA Kazutomo <
>>>>> nakahira_kazutomo_b1 at lab.ntt.co.jp> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>>
>>>>>> fine, cl2_lb1 takes over and acts as a slave, but the VIPs does not
>>>>>> come
>>>>>>
>>>>>>>
>>>>>>>
>>>>>> cl2_lb1 acts as a slave? It is not a master?
>>>>>> VIPs comes up with master msPostgresql resource.
>>>>>>
>>>>>> If promote action was failed in the cl2_lb1, then
>>>>>> please send a ha-log and PostgreSQL's log.
>>>>>>
>>>>>> Best regards,
>>>>>> Kazutomo NAKAHIRA
>>>>>>
>>>>>>
>>>>>> On 2015/03/16 16:09, Wynand Jansen van Vuuren wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>>>
>>>>>>> I have 2 nodes, with 2 interfaces each, ETH0 is used for an
>>>>>>> application,
>>>>>>> CBC, that's writing to the Postgres DB on the VIP-MASTER
>>>>>>> 172.28.200.159,
>>>>>>> ETH1 is used for the Corosync configuration and VIP-REP, everything
>>>>>>> works,
>>>>>>> but if the master currently on cl1_lb1 has a catastrophic failure,
>>>>>>> like
>>>>>>> power down, the VIPs does not start on the slave, the Postgres parts
>>>>>>> works
>>>>>>> fine, cl2_lb1 takes over and acts as a slave, but the VIPs does not
>>>>>>> come
>>>>>>> up. If I test it manually, IE kill the application 3 times on the
>>>>>>> master,
>>>>>>> the switchover is smooth, same if I kill Postgres on master, but when
>>>>>>> there
>>>>>>> is a power failure on the Master, the VIPs stay down. If I then
>>>>>>> delete
>>>>>>> the
>>>>>>> attributes pgsql-data-status="LATEST" and attributes
>>>>>>> pgsql-data-status="STREAMING|SYNC" on the slave after power off on
>>>>>>> the
>>>>>>> master and restart everything, then the VIPs come up on the slave,
>>>>>>> any
>>>>>>> ideas please?
>>>>>>> I'm using this setup
>>>>>>> http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster
>>>>>>>
>>>>>>> With this configuration below
>>>>>>> node cl1_lb1 \
>>>>>>> attributes pgsql-data-status="LATEST"
>>>>>>> node cl2_lb1 \
>>>>>>> attributes pgsql-data-status="STREAMING|SYNC"
>>>>>>> primitive CBC_instance ocf:heartbeat:cbc \
>>>>>>> op monitor interval="60s" timeout="60s" on-fail="restart"
>>>>>>> \
>>>>>>> op start interval="0s" timeout="60s" on-fail="restart" \
>>>>>>> meta target-role="Started" migration-threshold="3"
>>>>>>> failure-timeout="60s"
>>>>>>> primitive failover_MailTo ocf:heartbeat:MailTo \
>>>>>>> params email="wynandj at rorotika.com" subject="Cluster
>>>>>>> Status
>>>>>>> change
>>>>>>> - " \
>>>>>>> op monitor interval="10" timeout="10" dept="0"
>>>>>>> primitive pgsql ocf:heartbeat:pgsql \
>>>>>>> params pgctl="/opt/app/PostgreSQL/9.3/bin/pg_ctl"
>>>>>>> psql="/opt/app/PostgreSQL/9.3/bin/psql"
>>>>>>> config="/opt/app/pgdata/9.3/postgresql.conf" pgdba="postgres"
>>>>>>> pgdata="/opt/app/pgdata/9.3/" start_opt="-p 5432" rep_mode="sync"
>>>>>>> node_list="cl1_lb1 cl2_lb1" restore_command="cp
>>>>>>> /pgtablespace/archive/%f
>>>>>>> %p" primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5
>>>>>>> keepalives_count=5" master_ip="172.16.0.5" restart_on_promote="false"
>>>>>>> logfile="/var/log/OCF.log" \
>>>>>>> op start interval="0s" timeout="60s" on-fail="restart" \
>>>>>>> op monitor interval="4s" timeout="60s" on-fail="restart" \
>>>>>>> op monitor interval="3s" role="Master" timeout="60s"
>>>>>>> on-fail="restart" \
>>>>>>> op promote interval="0s" timeout="60s" on-fail="restart" \
>>>>>>> op demote interval="0s" timeout="60s" on-fail="stop" \
>>>>>>> op stop interval="0s" timeout="60s" on-fail="block" \
>>>>>>> op notify interval="0s" timeout="60s"
>>>>>>> primitive vip-master ocf:heartbeat:IPaddr2 \
>>>>>>> params ip="172.28.200.159" nic="eth0" iflabel="CBC_VIP"
>>>>>>> cidr_netmask="24" \
>>>>>>> op start interval="0s" timeout="60s" on-fail="restart" \
>>>>>>> op monitor interval="10s" timeout="60s" on-fail="restart"
>>>>>>> \
>>>>>>> op stop interval="0s" timeout="60s" on-fail="block" \
>>>>>>> meta target-role="Started"
>>>>>>> primitive vip-rep ocf:heartbeat:IPaddr2 \
>>>>>>> params ip="172.16.0.5" nic="eth1" iflabel="REP_VIP"
>>>>>>> cidr_netmask="24" \
>>>>>>> meta migration-threshold="0" target-role="Started" \
>>>>>>> op start interval="0s" timeout="60s" on-fail="stop" \
>>>>>>> op monitor interval="10s" timeout="60s" on-fail="restart"
>>>>>>> \
>>>>>>> op stop interval="0s" timeout="60s" on-fail="restart"
>>>>>>> group master-group vip-master vip-rep CBC_instance failover_MailTo
>>>>>>> ms msPostgresql pgsql \
>>>>>>> meta master-max="1" master-node-max="1" clone-max="2"
>>>>>>> clone-node-max="1" notify="true"
>>>>>>> colocation rsc_colocation-1 inf: master-group msPostgresql:Master
>>>>>>> order rsc_order-1 0: msPostgresql:promote master-group:start
>>>>>>> symmetrical=false
>>>>>>> order rsc_order-2 0: msPostgresql:demote master-group:stop
>>>>>>> symmetrical=false
>>>>>>> property $id="cib-bootstrap-options" \
>>>>>>> dc-version="1.1.9-2db99f1" \
>>>>>>> cluster-infrastructure="classic openais (with plugin)" \
>>>>>>> expected-quorum-votes="2" \
>>>>>>> no-quorum-policy="ignore" \
>>>>>>> stonith-enabled="false" \
>>>>>>> cluster-recheck-interval="1min" \
>>>>>>> crmd-transition-delay="0s" \
>>>>>>> last-lrm-refresh="1426485983"
>>>>>>> rsc_defaults $id="rsc-options" \
>>>>>>> resource-stickiness="INFINITY" \
>>>>>>> migration-threshold="1"
>>>>>>> #vim:set syntax=pcmk
>>>>>>>
>>>>>>> Any ideas please, I'm lost......
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started: http://www.clusterlabs.org/
>>>>>>> doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/
>>>>>> doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org
>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/
>>>>> doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>>
>>>>>
>>>> --
>>>> NTT オープンソースソフトウェアセンタ
>>>> 中平 和友
>>>> TEL: 03-5860-5135 FAX: 03-5463-6490
>>>> Mail: nakahira_kazutomo_b1 at lab.ntt.co.jp
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/
>>>> doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>
>> --
>> NTT オープンソースソフトウェアセンタ
>> 中平 和友
>> TEL: 03-5860-5135 FAX: 03-5463-6490
>> Mail: nakahira_kazutomo_b1 at lab.ntt.co.jp
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150317/f6fe4701/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync.log
Type: application/octet-stream
Size: 536076 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150317/f6fe4701/attachment-0008.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: postgres_log_on_slave.log
Type: application/octet-stream
Size: 6915 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150317/f6fe4701/attachment-0009.obj>
More information about the Users
mailing list