[ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

Tue May 29 21:56:42 UTC 2018

> On May 27, 2018, at 2:28 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
> 
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2    pengine:     info:
>> determine_op_status: Operation monitor found resource postgresql-10-
>> main:2 active on d-gp2-dbpg0-2
> 
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2    pengine:   notice:
>> LogActions:  Demote  postgresql-10-main:1    (Master -> Slave d-gp2-
>> dbpg0-1)
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2    pengine:   notice:
>> LogActions:  Recover postgresql-10-main:1    (Master d-gp2-dbpg0-1)
> 
> From the above, we can see that the initial probe after the node
> rejoined found that the resource was already running in master mode
> there (at least, that's what the agent thinks). So, the cluster wants
> to demote it, stop it, and start it again as a slave.

Are you sure you're reading the above correctly?  The first line you quoted says the resource is already active on node 2, which is not the node that was restarted, and is the node that took over as master after I powered node 1 off.

Anyways I enabled debug logging in corosync.conf, and I now see the following information:

May 29 20:59:28 [10583] d-gp2-dbpg0-2 crm_resource:    debug: determine_op_status:      postgresql-10-main_monitor_0 on d-gp2-dbpg0-1 returned 'master (failed)' (9) instead of the expected value: 'not running' (7)
May 29 20:59:28 [10583] d-gp2-dbpg0-2 crm_resource:  warning: unpack_rsc_op_failure:    Processing failed op monitor for postgresql-10-main:1 on d-gp2-dbpg0-1: master (failed) (9)
May 29 20:59:28 [10583] d-gp2-dbpg0-2 crm_resource:    debug: determine_op_status:      postgresql-10-main_monitor_0 on d-gp2-dbpg0-1 returned 'master (failed)' (9) instead of the expected value: 'not running' (7)
May 29 20:59:28 [10583] d-gp2-dbpg0-2 crm_resource:  warning: unpack_rsc_op_failure:    Processing failed op monitor for postgresql-10-main:1 on d-gp2-dbpg0-1: master (failed) (9)

I'm not sure why these lines appear twice (same question I've had in the past about some log messages), but it seems that whatever it's doing to check the status of the resource, it is correctly determining that PostgreSQL failed while in master state, rather than being shut down cleanly.  Why this results in the node being fenced is beyond me.

I don't feel that I'm trying to do anything complex - just have a simple cluster that handles PostgreSQL failover.  I'm not trying to do anything fancy and am pretty much following the PAF docs, plus the addition of the fencing resource (which it says it requires to work properly - if this is "properly" I don't understand what goal it is trying to achieve...).  I'm getting really frustrated with pacemaker as I've been fighting hard to try to get it working for two months now and still feel in the dark about why it's behaving the way it is.  I'm sorry if I seem like an idiot...this definitely makes me feel like one...

Here is my configuration again, in case it helps:

Cluster Name: d-gp2-dbpg0
Corosync Nodes:
 d-gp2-dbpg0-1 d-gp2-dbpg0-2 d-gp2-dbpg0-3
Pacemaker Nodes:
 d-gp2-dbpg0-1 d-gp2-dbpg0-2 d-gp2-dbpg0-3

Resources:
 Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.124.164.250 cidr_netmask=22
  Operations: start interval=0s timeout=20s (postgresql-master-vip-start-interval-0s)
              stop interval=0s timeout=20s (postgresql-master-vip-stop-interval-0s)
              monitor interval=10s (postgresql-master-vip-monitor-interval-10s)
 Master: postgresql-ha
  Meta Attrs: notify=true 
  Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/lib/postgresql/10/bin pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432 recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c config_file=/etc/postgresql/10/main/postgresql.conf"
   Operations: start interval=0s timeout=60s (postgresql-10-main-start-interval-0s)
               stop interval=0s timeout=60s (postgresql-10-main-stop-interval-0s)
               promote interval=0s timeout=30s (postgresql-10-main-promote-interval-0s)
               demote interval=0s timeout=120s (postgresql-10-main-demote-interval-0s)
               monitor interval=15s role=Master timeout=10s (postgresql-10-main-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=10s (postgresql-10-main-monitor-interval-16s)
               notify interval=0s timeout=60s (postgresql-10-main-notify-interval-0s)

Stonith Devices:
 Resource: vfencing (class=stonith type=external/vcenter)
  Attributes: VI_SERVER=10.124.137.100 VI_CREDSTORE=/etc/pacemaker/vicredentials.xml HOSTLIST=d-gp2-dbpg0-1;d-gp2-dbpg0-2;d-gp2-dbpg0-3 RESETPOWERON=1
  Operations: monitor interval=60s (vfencing-monitor-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote postgresql-ha then start postgresql-master-vip (kind:Mandatory) (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory)
  demote postgresql-ha then stop postgresql-master-vip (kind:Mandatory) (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory-1)
Colocation Constraints:
  postgresql-master-vip with postgresql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-postgresql-master-vip-postgresql-ha-INFINITY)

Resources Defaults:
 migration-threshold: 5
 resource-stickiness: 10
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: d-gp2-dbpg0
 dc-version: 1.1.14-70404b0
 have-watchdog: false
 stonith-enabled: true
Node Attributes:
 d-gp2-dbpg0-1: master-postgresql-10-main=-1000
 d-gp2-dbpg0-2: master-postgresql-10-main=1001
 d-gp2-dbpg0-3: master-postgresql-10-main=1000

Thanks,
-- 
Casey