[ClusterLabs] Antw: ocf:heartbeat:pgsql not starting

Fri Aug 12 07:19:51 UTC 2016

Two tips:

1) Did you stop the configured postgres in the cluster and put it into maintenance mode while tyring OCF-tester?
2) When testing my RAs I replace "'!/bin/sh" with "#!/bin/sh -x" temporarily. It produces a lot of output, but sometimes you'll find the problem.

Regards,
Ulrich

>>> Darren Kinley <dkinley at mdacorporation.com> schrieb am 11.08.2016 um 23:44 in
Nachricht <0C9F39FD10C20E49BDFE9C5B09C5E7D83F955EDE at exbermd01.ds.mda.ca>:
> Hi,
> 
> I have PostgreSQL 9.3 replicated and I'm trying to put it under Pacemaker 
> control
> using ocf:heartbeat:pgsql provided by SLES12SP1.
> 
> This is the crmsh script that I used to configure Pacemaker.
> 
>         configure cib new pgsql_cfg --force
>         configure primitive res-ars-pgsql ocf:heartbeat:pgsql \
>            pgctl="/usr/lib/postgresql93/bin/pg_ctl" \
>            psql="/usr/lib/postgresql93/bin/psql" \
>            pgdata="/var/lib/pgsql/data/" \
>            rep_mode="sync" \
>            node_list="ars1 ars2" \
>            restore_command="cp /var/lib/pgsql/pg_archive/%f %p" \
>            primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 
> keepalives_count=5" \
>            master_ip="192.168.244.223" \
>            restart_on_promote='true' \
>            pghost="191.168.244.223" \
>            repuser="postgres" \
>            check_wal_receiver='true' \
>            monitor_user='postgres' \
>            monitor_password='xxx' \
>            op start   timeout="120s" interval="0s"  on-fail="restart" \
>            op monitor timeout="120s" interval="4s" on-fail="restart" \
>            op monitor timeout="120s" interval="3s"  on-fail="restart" 
> role="Master" \
>            op promote timeout="120s" interval="0s"  on-fail="restart" \
>            op demote  timeout="120s" interval="0s"  on-fail="stop" \
>            op stop    timeout="120s" interval="0s"  on-fail="block" \
>            op notify  timeout="90s" interval="0s"
>         configure ms ms-ars-pgsql res-ars-pgsql \
>            meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 
> notify=true
>         configure colocation col-ars-pgsql-with-drbd inf: ms-ars-pgsql:Master 
> ms-ars-drbd:Master
>         configure cib commit pgsql_cfg
> 
> I have a ~postgres/.pgpass
> 
> 
> My nodes remain stopped and only once during the 12 hours I've been working 
> on this
> did both nodes try to bring up PG (both in recovery mode) before shutting 
> them both down.
> 
> When running ocf-tester I think that I'm to name the master/slave resource.
> 
>         ars2:/usr/lib/ocf/resource.d/heartbeat # ocf-tester -v -n ms-ars-pgsql 
> `pwd`/pgsql
>         Beginning tests for /usr/lib/ocf/resource.d/heartbeat/pgsql...
>         Testing permissions with uid nobody
>         Testing: meta-data
>         Testing: meta-data
>         ...
>         <XML removed/>
>         ...
>         Testing: validate-all
>         Checking current state
>         Testing: stop
>         INFO: waiting for server to shut down.... done server stopped
>         INFO: PostgreSQL is down
>         Testing: monitor
>         INFO: PostgreSQL is down
>         Testing: monitor
>         ocf-exit-reason:Setup problem: couldn't find command: /usr/bin/pg_ctl
>         Testing: start
>         INFO: server starting
>         INFO: PostgreSQL start command sent.
>         INFO: PostgreSQL is started.
>         Testing: monitor
>         Testing: monitor
>         INFO: Don't check /var/lib/pgsql/data during probe
>         Testing: notify
>         Checking for demote action
>         ocf-exit-reason:Not in a replication mode.
>         Checking for promote action
>         ocf-exit-reason:Not in a replication mode.
>         Testing: demotion of started resource
>         ocf-exit-reason:Not in a replication mode.
>         * rc=6: Demoting a start resource should not fail
>         Testing: promote
>         ocf-exit-reason:Not in a replication mode.
>         * rc=6: Promote failed
>         Testing: demote
>         ocf-exit-reason:Not in a replication mode.
>         * rc=6: Demote failed
>         Aborting tests
> 
> 
> 'Not in a replication mode' disagrees with the res-ars-pgsql above.
> I'm not sure that the pacemaker.log for CIB changes is needed.
> 
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: clone_print:   
> Master/Slave Set: ms-ars-pgsql [res-ars-pgsql]
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: short_print:       
> Stopped: [ ars1 ars2 ]
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: 
> get_failcount_full:   res-ars-pgsql:0 has failed INFINITY times on ars1
>         Aug 11 09:19:53 [2757] ars2    pengine:  warning: 
> common_apply_stickiness:      Forcing ms-ars-pgsql away from ars1 after 1000000 
> failures (max=1000000)
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: 
> get_failcount_full:   ms-ars-pgsql has failed INFINITY times on ars1
>         Aug 11 09:19:53 [2757] ars2    pengine:  warning: 
> common_apply_stickiness:      Forcing ms-ars-pgsql away from ars1 after 1000000 
> failures (max=1000000)
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: 
> get_failcount_full:   res-ars-pgsql:0 has failed INFINITY times on ars2
>         Aug 11 09:19:53 [2757] ars2    pengine:  warning: 
> common_apply_stickiness:      Forcing ms-ars-pgsql away from ars2 after 1000000 
> failures (max=1000000)
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: 
> get_failcount_full:   ms-ars-pgsql has failed INFINITY times on ars2
>         Aug 11 09:19:53 [2757] ars2    pengine:  warning: 
> common_apply_stickiness:      Forcing ms-ars-pgsql away from ars2 after 1000000 
> failures (max=1000000)
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: rsc_merge_weights: 
>    ms-ars-drbd: Rolling back scores from ms-ars-pgsql
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: master_color: 
> Promoting res-ars-drbd:1 (Master ars2)
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: master_color: 
> ms-ars-drbd: Promoted 1 instances of a possible 1 to master
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: native_color: 
> res-ars-pgsql:0: Rolling back scores from ms-ars-drbd
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: native_color: 
> Resource res-ars-pgsql:0 cannot run anywhere
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: native_color: 
> res-ars-pgsql:1: Rolling back scores from ms-ars-drbd
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: native_color: 
> Resource res-ars-pgsql:1 cannot run anywhere
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: master_color: 
> ms-ars-pgsql: Promoted 0 instances of a possible 1 to master
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   
> Leave   res-mgmt-vip    (Started ars2)
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   
> Leave   res-mgmt-app    (Started ars2)
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   
> Leave   res-ars-vip     (Started ars2)
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   
> Leave   res-ars-drbd:0  (Slave ars1)
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   
> Leave   res-ars-drbd:1  (Master ars2)
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   
> Leave   res-ars-lvm     (Started ars2)
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   
> Leave   res-ars-fs_dropbox      (Started ars2)
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   
> Leave   res-ars-fs_svndata      (Started ars2)
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   
> Leave   res-ars-pgsql:0 (Stopped)
>         Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   
> Leave   res-ars-pgsql:1 (Stopped)
>         Aug 11 09:19:53 [2758] ars2       crmd:     info: 
> do_state_transition:  State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ 
> input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
>         Aug 11 09:19:53 [2758] ars2       crmd:   notice: do_te_invoke: 
> Processing graph 222 (ref=pe_calc-dc-1470932393-1349) derived from 
> /var/lib/pacemaker/pengine/pe-input-625.bz2
> 
> and /var/log/messages
> 
>         2016-08-11T09:19:53.146603-07:00 ars-2 crmd[2758]:   notice: State 
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
> origin=crm_timer_popped ]
>         2016-08-11T09:19:53.152322-07:00 ars-2 pengine[2757]:   notice: On loss 
> of CCM Quorum: Ignore
>         2016-08-11T09:19:53.153078-07:00 ars-2 pengine[2757]:  warning: Forcing 
> ms-ars-pgsql away from ars1 after 1000000 failures (max=1000000)
>         2016-08-11T09:19:53.153266-07:00 ars-2 pengine[2757]:  warning: Forcing 
> ms-ars-pgsql away from ars1 after 1000000 failures (max=1000000)
>         2016-08-11T09:19:53.153395-07:00 ars-2 pengine[2757]:  warning: Forcing 
> ms-ars-pgsql away from ars2 after 1000000 failures (max=1000000)
>         2016-08-11T09:19:53.153547-07:00 ars-2 pengine[2757]:  warning: Forcing 
> ms-ars-pgsql away from ars2 after 1000000 failures (max=1000000)
>         2016-08-11T09:19:53.155568-07:00 ars-2 crmd[2758]:   notice: Processing 
> graph 222 (ref=pe_calc-dc-1470932393-1349) derived from 
> /var/lib/pacemaker/pengine/pe-input-625.bz2
>         2016-08-11T09:19:53.155768-07:00 ars-2 pengine[2757]:   notice: 
> Calculated Transition 222: /var/lib/pacemaker/pengine/pe-input-625.bz2
>         2016-08-11T09:19:53.155927-07:00 ars-2 crmd[2758]:   notice: Transition 
> 222 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-625.bz2): Complete
>         2016-08-11T09:19:53.156085-07:00 ars-2 crmd[2758]:   notice: State 
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> 
> Can anyone provide thoughs on how to debug this?
> Should I give up with the SLES provided RA and use PAF instead?
> 
> Thanks,
> Darren