[Pacemaker] Postgresql streaming replication failover - RA needed

Mon Dec 12 07:32:20 EST 2011

Hello

2011/12/12 Serge Dubrouski <sergeyfd at gmail.com>:
>
>
> On Thu, Dec 8, 2011 at 10:34 PM, Takatoshi MATSUO <matsuo.tak at gmail.com>
> wrote:
>>
>> Hi Attila
>>
>> 2011/12/8 Attila Megyeri <amegyeri at minerva-soft.com>:
>> > Hi Takatoshi,
>> >
>> > One strange thing I noticed and could probably be improved.
>> > When there is data inconsistency, I have the following node properties:
>> >
>> > * Node psql2:
>> >    + default_ping_set                  : 100
>> >    + master-postgresql:1               : -INFINITY
>> >    + pgsql-data-status                 : DISCONNECT
>> >    + pgsql-status                      : HS:alone
>> > * Node psql1:
>> >    + default_ping_set                  : 100
>> >    + master-postgresql:0               : 1000
>> >    + master-postgresql:1               : -INFINITY
>> >    + pgsql-data-status                 : LATEST
>> >    + pgsql-master-baseline             : 58:000000004B000020
>> >    + pgsql-status                      : PRI
>> >
>> > This is fine, and understandable - but I can see this only if I do a
>> > crm_mon -A.
>> >
>> > My problem is, that CRM shows the following:
>> >
>> > Master/Slave Set: db-ms-psql [postgresql]
>> >     Masters: [ psql1 ]
>> >     Slaves: [ psql2 ]
>> >
>> > So if I monitor the system from crm_mon, HAWK or ther tools - I have no
>> > indication at all that the slave is running in an inconsistent mode.
>> >
>> > I would expect the RA to stop the psql2 node in such cases, because:
>> > - It is running, but has non-up-to-date data, therefore noone will use
>> > it (the slave IP points to the master as well, which is good)
>> > - In CRM status eveything looks perfect, even though it is NOT perfect
>> > and admin intervention is required.
>> >
>> >
>> > Shouldn't the disconnected PSQL server be stopped instead?
>>
>> hmm..
>> It's not better to stop PGSQL server.
>> RA cannot know whether PGSQL is disconnected because of
>> data-inconsistent or network-down or
>> starting-up and so on.
>
>
> Why does it matter? If the state is degraded and inconsistent and there is
> no way to fix it from inside of the RA, RA should probably stop it.

In this case, HS's data may be cosistent but Primary dosen't have enough wals or
HS dosen't have enough wal-archives to be replication-mode.
Unfortunately this RA dosen't calculate the number of wals.

> Let's say that there is pgpool running in front of the cluster, keeping an
> inconsistent node up would lead to the routing SQL queries to it and
> possibly getting wrong results.
>

It dosen't happen in my sample configuration.
vip-slave is up at master when slave is not "HS:sync".

>>
>>
>>
>> How about using dummy RA such as vip-slave?
>> -------------------------------------------
>> primitive runningSlaveOK ocf:heartbeat:Dummy
>> .....(snip)
>>
>> location rsc_location-dummy runningSlaveOK \
>>     rule  200: pgsql-status eq "HS:sync"
>> -------------------------------------------

>
> That probably fixes visibility issue. What about notifications on DISCONNECT
> state? How administrator would know that cluster is inconsistent? May be the
> better option in this case would be collocating MailTo resource with
> "HS:alone"?

Yes, it's good idea if you want to receive notifications.

Regards,
Takatoshi MATSUO