[ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

Wed May 30 22:30:54 UTC 2018

> In this case, the agent is returning "master (failed)", which does not
> mean that it previously failed when it was master -- it means it is
> currently running as master, in a failed condition.

Well, it surely is NOT running.  So the likely problem is the way it's doing this check?  I see a lot of people here using PAF - I'd be surprised if such a bug weren't discovered already...

> Stopping an already stopped service does not return an error -- here,
> the agent is saying it was unable to demote or stop a running instance.

I still don't understand.  There are *NO* postgres processes running on the node, no additions to the log file.  Nothing whatsoever that supports the notion that it's a running instance.

> Unfortunately clustering has some inherent complexity that gives it a
> steep learning curve. On top of that, logging/troubleshooting
> improvements are definitely an area of ongoing need in pacemaker. The
> good news is that once a cluster is running successfully, it's usually
> smooth sailing after that.

I hope so...  I just don't see what I'm doing that's outside of the standard box.  I've set up PAF following it's instructions.  I see that others here are using it.  Hasn't anybody else gotten such a setup working already?  I would think this is a pretty standard failure case that anybody would test if they've set up a cluster...  In any case, I'll keep persisting as long as I can...on to debugging...

> You can debug like this:
> 
> 1. Unmanage the resource in pacemaker, so you can mess with it
> manually.
> 
> 2. Cause the desired failure for testing. Pacemaker should detect the
> failure, but not do anything about it.

I executed `pcs resource unmanage postgresql-ha`, and then powered off the master node.  The fencing kicked in and restarted the node.  After the node rebooted, I issued a `pcs cluster start` on it as the crm_resource command complained about the CIB without doing that.

I then ended up seeing this:

------
 vfencing       (stonith:external/vcenter):     Started d-gp2-dbpg0-1
 postgresql-master-vip  (ocf::heartbeat:IPaddr2):       Started d-gp2-dbpg0-2
 Master/Slave Set: postgresql-ha [postgresql-10-main] (unmanaged)
     postgresql-10-main (ocf::heartbeat:pgsqlms):       Started d-gp2-dbpg0-3 (unmanaged)
     postgresql-10-main (ocf::heartbeat:pgsqlms):       Slave d-gp2-dbpg0-1 (unmanaged)
     postgresql-10-main (ocf::heartbeat:pgsqlms):       FAILED Master d-gp2-dbpg0-2 (unmanaged)

Failed Actions:
* postgresql-10-main_monitor_0 on d-gp2-dbpg0-2 'master (failed)' (9): call=14, status=complete, exitreason='Instance "postgresql-10-main" controldata indicates a running primary instance, the instance has probably crashed',
    last-rc-change='Wed May 30 22:18:16 2018', queued=0ms, exec=190ms
* postgresql-10-main_monitor_15000 on d-gp2-dbpg0-2 'master (failed)' (9): call=16, status=complete, exitreason='Instance "postgresql-10-main" controldata indicates a running primary instance, the instance has probably crashed',
    last-rc-change='Wed May 30 22:18:16 2018', queued=0ms, exec=138ms
------

> 3. Run crm_resource with the -VV option and --force-* with whatever
> action you want to attempt (in this case, demote or stop). The -VV (aka
> --verbose --verbose) will turn on OCF_TRACE_RA. The --force-* command
> will read the resource configuration and do the same thing pacemaker
> would do to execute the command.

I thought that I would want to see what the "check" is doing to do the check, since you're telling me that it thinks the service is running when it's definitely not.  I tried the following command which didn't work (am I doing something wrong?):

------
root at d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-check 
 warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
 warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
Operation monitor for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5
 >  stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match (m//) at /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line 392.
 >  stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and greater
Error performing operation: Input/output error
------

Attempting to force-demote didn't work either:

------
root at d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-demote
 warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
 warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
resource postgresql-ha is running on: d-gp2-dbpg0-3 
resource postgresql-ha is running on: d-gp2-dbpg0-1 
resource postgresql-ha is running on: d-gp2-dbpg0-2 Master
It is not safe to demote postgresql-ha here: the cluster claims it is already active
Try setting target-role=stopped first or specifying --force

root at d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-demote --force
 warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
 warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
resource postgresql-ha is running on: d-gp2-dbpg0-3 
resource postgresql-ha is running on: d-gp2-dbpg0-1 
resource postgresql-ha is running on: d-gp2-dbpg0-2 Master
Operation demote for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5
 >  stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match (m//) at /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line 392.
 >  stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and greater
Error performing operation: Input/output error
------

Neither did force-stop:

------
root at d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-stop  
 warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
 warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
Operation stop for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5
 >  stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match (m//) at /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line 392.
 >  stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and greater
Error performing operation: Input/output error
------

Thanks,
-- 
Casey