[ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)
Casey & Gina
caseyandgina at icloud.com
Wed May 30 18:30:54 EDT 2018
> In this case, the agent is returning "master (failed)", which does not
> mean that it previously failed when it was master -- it means it is
> currently running as master, in a failed condition.
Well, it surely is NOT running. So the likely problem is the way it's doing this check? I see a lot of people here using PAF - I'd be surprised if such a bug weren't discovered already...
> Stopping an already stopped service does not return an error -- here,
> the agent is saying it was unable to demote or stop a running instance.
I still don't understand. There are *NO* postgres processes running on the node, no additions to the log file. Nothing whatsoever that supports the notion that it's a running instance.
> Unfortunately clustering has some inherent complexity that gives it a
> steep learning curve. On top of that, logging/troubleshooting
> improvements are definitely an area of ongoing need in pacemaker. The
> good news is that once a cluster is running successfully, it's usually
> smooth sailing after that.
I hope so... I just don't see what I'm doing that's outside of the standard box. I've set up PAF following it's instructions. I see that others here are using it. Hasn't anybody else gotten such a setup working already? I would think this is a pretty standard failure case that anybody would test if they've set up a cluster... In any case, I'll keep persisting as long as I can...on to debugging...
> You can debug like this:
>
> 1. Unmanage the resource in pacemaker, so you can mess with it
> manually.
>
> 2. Cause the desired failure for testing. Pacemaker should detect the
> failure, but not do anything about it.
I executed `pcs resource unmanage postgresql-ha`, and then powered off the master node. The fencing kicked in and restarted the node. After the node rebooted, I issued a `pcs cluster start` on it as the crm_resource command complained about the CIB without doing that.
I then ended up seeing this:
------
vfencing (stonith:external/vcenter): Started d-gp2-dbpg0-1
postgresql-master-vip (ocf::heartbeat:IPaddr2): Started d-gp2-dbpg0-2
Master/Slave Set: postgresql-ha [postgresql-10-main] (unmanaged)
postgresql-10-main (ocf::heartbeat:pgsqlms): Started d-gp2-dbpg0-3 (unmanaged)
postgresql-10-main (ocf::heartbeat:pgsqlms): Slave d-gp2-dbpg0-1 (unmanaged)
postgresql-10-main (ocf::heartbeat:pgsqlms): FAILED Master d-gp2-dbpg0-2 (unmanaged)
Failed Actions:
* postgresql-10-main_monitor_0 on d-gp2-dbpg0-2 'master (failed)' (9): call=14, status=complete, exitreason='Instance "postgresql-10-main" controldata indicates a running primary instance, the instance has probably crashed',
last-rc-change='Wed May 30 22:18:16 2018', queued=0ms, exec=190ms
* postgresql-10-main_monitor_15000 on d-gp2-dbpg0-2 'master (failed)' (9): call=16, status=complete, exitreason='Instance "postgresql-10-main" controldata indicates a running primary instance, the instance has probably crashed',
last-rc-change='Wed May 30 22:18:16 2018', queued=0ms, exec=138ms
------
> 3. Run crm_resource with the -VV option and --force-* with whatever
> action you want to attempt (in this case, demote or stop). The -VV (aka
> --verbose --verbose) will turn on OCF_TRACE_RA. The --force-* command
> will read the resource configuration and do the same thing pacemaker
> would do to execute the command.
I thought that I would want to see what the "check" is doing to do the check, since you're telling me that it thinks the service is running when it's definitely not. I tried the following command which didn't work (am I doing something wrong?):
------
root at d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-check
warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
Operation monitor for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5
> stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match (m//) at /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line 392.
> stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and greater
Error performing operation: Input/output error
------
Attempting to force-demote didn't work either:
------
root at d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-demote
warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
resource postgresql-ha is running on: d-gp2-dbpg0-3
resource postgresql-ha is running on: d-gp2-dbpg0-1
resource postgresql-ha is running on: d-gp2-dbpg0-2 Master
It is not safe to demote postgresql-ha here: the cluster claims it is already active
Try setting target-role=stopped first or specifying --force
root at d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-demote --force
warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
resource postgresql-ha is running on: d-gp2-dbpg0-3
resource postgresql-ha is running on: d-gp2-dbpg0-1
resource postgresql-ha is running on: d-gp2-dbpg0-2 Master
Operation demote for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5
> stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match (m//) at /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line 392.
> stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and greater
Error performing operation: Input/output error
------
Neither did force-stop:
------
root at d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-stop
warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
Operation stop for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5
> stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match (m//) at /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line 392.
> stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and greater
Error performing operation: Input/output error
------
Thanks,
--
Casey
More information about the Users
mailing list