[ClusterLabs] Postgresql HA with Corosync/Pacemaker

Thu Mar 12 02:54:28 EDT 2015

Andrew et al,

I'm impressed by the amount of work gone into this project. Normally
nothing but praises. Today I'm at my wits end. After several weeks of
unsuccessfully jostling with the unnavigable Mischung of software versions,
cluster shells, and OS-vendor-specific issues, I'm turning to you for help.

Goal: HA Postgresql cluster not unlike described at
http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster, however with quorum
(a third, voting-only entity that arbitrates split-brain issues).

OS/Vendor: Linux 2.6.32, RHEL 6.5  (The voiting-only quorum-maker is
actually CentOS 6.6. It was not my choice.)
Postgresql: 9.1.8  (currently stuck with that for compatibility issues and
experience base, but I could upgrade)
Transport Protocol: udpu (required for network reasons)

-- Rounds 1 through 5  --
  Corosync-1.4.7
  Pacemaker 1.1.12  (libs, cli)
  clusterlib-3.0.12
  cluster-glue-libs-1.0.5
  resource-agents-3.9.5

  Note: Between my 1st & 2nd tries on this one, 5 weeks had passed due to
an illness.

  First, the documentation (see link) is in error, as it suggests a
corosync configuration with service pacemaker version 0 and then describes
launching pacemaker after corosync. Specifying version 0 here will cause
corosync to attempt to start pacemaker as a plugin. The directives conflict
somehow and nothing really works -- corosync shows slave zombied
processes.  Also had another error causing unexpected results in the TOTEM
protocol -- I think it was fixed by removing mcastaddr which apparently
didn't mix well with the udpu transport protocol.

  Eventually, corocsync ran correctly and commands such as "crm_mon -Afr
-1" showed expected results.

  Also got postgresql to work as master/slave (this is where my experience
base is, so no problem here).

  Used the pcs configuration fairly close to described in the
documentation. I'm not using 3 different subnets as that's really
unnecessary and quite impossible for me. There are two "physical" IPs and 2
virtual/service IPs. They are all within the same 22/CIDR LAN, but I don't
see how that's a problem.

  After running the pcs script, pacemaker starts both the PRI and HS in
recovery mode. I cannot see a reason for this.
  I start over with the configuration (clearing it, restarting pacemaker
everywhere) and this time, leave pacemaker not running on the secondary.
The primary is not started. There is no given reason for it, but the
crm_mon output indicates something odd:
    + master-pgsql                      : -INFINITY
    + pgsql-data-status                 : LATEST
Logs indicate no attempt to starting the database was attempted. But again,
if I have the HS in the cluster, *it* gets started but in recovery mode.

  But perhaps I should be using corosync 2.3.3, which is available for the
above platforms.

-- Round 6 --
  Install corosync-2.3.3. Won't upgrade due to conflicts; must uninstall
pacemaker first. Afterwards, clusterlib refuses to install due to
dependency conflicts. Another user posted this problem on this forum 2
years ago, so I'm surpised this is still an issue. By the way, the repo URL
for this is:
http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-6/
  Among the unresolvable errors include:
           Requires: libcoroipcc.so.4()(64bit)
Error: Package: pacemaker-libs-1.1.12+git20140723.483f48a-1.1.x86_64
(network_ha-clustering_Stable)
           Requires: libconfdb.so.4()(64bit)
Error: Package: pacemaker-cli-1.1.12+git20140723.483f48a-1.1.x86_64
(network_ha-clustering_Stable)

What's the best route for me to go here? Find the right set of RPMs? Build
from source? If so which versions? Throw it all away and try to get CMAN
(shiver)? Go back to Corosync 1.4.7 and try again?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20150312/51782a41/attachment-0002.html>