[Pacemaker] PostgreSQL failed to stop after streaming replication established

Tue Aug 27 08:15:06 EDT 2013

Hi all.

I formatted drbd disk to get rid of the corrupted postmaster.pid file. After
this everything works fine. I couldn't reproduce the issue anymore.

Best regards,

Michal Mistina

From: Mistina Michal [mailto:Michal.Mistina at virte.sk] 
Sent: Monday, August 19, 2013 9:39 AM
To: The Pacemaker cluster resource manager
Subject: [Pacemaker] PostgreSQL failed to stop after streaming replication
established

Dear community.

The scenario of redundant environment is in the "graphic" representation...

           +------------------------------------+

           |                          WAN                        |

           +                                                            v

+------------+------------+                +------------+------------+

|pgsql         |pgsql          |                |pgsql          |pgsql
|

+------------+------------+                +------------+------------+

|drbd-pri   |drbd-sec   |                |drbd-pri    |drbd-sec  |

+------------+------------+                +------------+------------+

|           pacemaker         |                |           pacemaker
|

+-------------------------+                +--------------------------+

|            corosync             |                |            corosync
|

+------------+------------+                +------------+------------+

|node1       |node2        |                |node1       |node2       |

+------------+------------+                +------------+------------+

                   TC1
TC2

Within each technical center everything worked fine when migrating resources
between nodes. 

Then I've set up streaming replication from TC1 to TC2. 

Now migration from one node to another failes. Pacemaker operation FAILED to
stop resource postgres. However postgresql was stopped but postmaster.pid
stayed corrupted.

Now I ended up like this.

I am unable to stop postgresql service correctly on TC1 (streaming
replication master). After issuing /etc/init.d/postgresql-9.2 stop the
postmaster.pid remains on the filesystem and moreover it is corrupted. I am
unable to delete it with rm command.

It looks like this:

[root at pcmk1 ~]# ll /var/lib/pgsql/9.2/data/

ls: cannot access /var/lib/pgsql/9.2/data/postmaster.pid: No such file or
directory total 56

drwx------ 7 postgres postgres    62 Jun 26 17:13 base

drwx------ 2 postgres postgres  4096 Aug 18 00:25 global

drwx------ 2 postgres postgres    17 Jun 26 09:54 pg_clog

-rw------- 1 postgres postgres  5127 Aug 17 16:24 pg_hba.conf

-rw------- 1 postgres postgres  1636 Jun 26 09:54 pg_ident.conf

drwx------ 2 postgres postgres  4096 Jul  2 00:00 pg_log

drwx------ 4 postgres postgres    34 Jun 26 09:53 pg_multixact

drwx------ 2 postgres postgres    17 Aug 18 00:23 pg_notify

drwx------ 2 postgres postgres     6 Jun 26 09:53 pg_serial

drwx------ 2 postgres postgres     6 Jun 26 09:53 pg_snapshots

drwx------ 2 postgres postgres     6 Aug 18 00:25 pg_stat_tmp

drwx------ 2 postgres postgres    17 Jun 26 09:54 pg_subtrans

drwx------ 2 postgres postgres     6 Jun 26 09:53 pg_tblspc

drwx------ 2 postgres postgres     6 Jun 26 09:53 pg_twophase

-rw------- 1 postgres postgres     4 Jun 26 09:53 PG_VERSION

drwx------ 3 postgres postgres  4096 Aug 18 00:25 pg_xlog

-rw------- 1 postgres postgres 19884 Aug 17 22:54 postgresql.conf

-rw------- 1 postgres postgres    71 Aug 18 00:23 postmaster.opts

?????????? ? ?        ?            ?            ? postmaster.pid

-rw-r--r-- 1 postgres postgres   491 Aug 17 16:33 recovery.done

I don't know if the resource agent did something wrong while pacemaker tried
stopping postgres or actually the postgres is the source component, which
failed to stop correctly. What do you think? Has somebody experienced
problem like this?

I am using:

-          pacemaker-1.1.7-6

-          corosync-1.4.1-7

-          resource-agents-3.9.2-12

-          drbd-8.4.3-2

CONFIGURATION

[root at pcmk2 9.2]# crm configure show

node pcmk1 \

        attributes standby="off"

node pcmk2 \

        attributes standby="off"

primitive drbd_pg ocf:linbit:drbd \

        params drbd_resource="postgres" \

        op monitor interval="15" role="Master" \

        op monitor interval="16" role="Slave" \

        op start interval="0" timeout="240" \

        op stop interval="0" timeout="120"

primitive pg_fs ocf:heartbeat:Filesystem \

        params device="/dev/vg_local-lv_pgsql/lv_pgsql"
directory="/var/lib/pgsql/9.2/data" options="noatime,nodiratime"
fstype="xfs" \

        op start interval="0" timeout="60" \

        op stop interval="0" timeout="120"

primitive pg_lsb lsb:postgresql-9.2 \

        op monitor interval="30" timeout="60" \

        op start interval="0" timeout="60" \

        op stop interval="0" timeout="60"

primitive pg_lvm ocf:heartbeat:LVM \

        params volgrpname="vg_local-lv_pgsql" \

        op start interval="0" timeout="30" \

        op stop interval="0" timeout="30"

primitive pg_vip ocf:heartbeat:IPaddr2 \

        params ip="x.x.x.x" iflabel="pcmkvip" \

        op monitor interval="5"

group PGServer pg_lvm pg_fs pg_lsb pg_vip \

        meta target-role="Started"

ms ms_drbd_pg drbd_pg \

        meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"

location master-prefer-node1 pg_vip 50: pcmk1

colocation col_pg_drbd inf: PGServer ms_drbd_pg:Master

order ord_pg inf: ms_drbd_pg:promote PGServer:start

property $id="cib-bootstrap-options" \

        dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \

        cluster-infrastructure="openais" \

        expected-quorum-votes="4" \

        stonith-enabled="false" \

        no-quorum-policy="ignore" \

        maintenance-mode="true" \

        last-lrm-refresh="1376753310"

rsc_defaults $id="rsc-options" \

        resource-stickiness="100"

Best regards,

Michal Mistina

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130827/1207c867/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3076 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130827/1207c867/attachment-0003.p7s>