[ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

Thu May 31 04:28:27 UTC 2018

31.05.2018 01:30, Casey & Gina пишет:
>> In this case, the agent is returning "master (failed)", which does not
>> mean that it previously failed when it was master -- it means it is
>> currently running as master, in a failed condition.
> 
> Well, it surely is NOT running.  So the likely problem is the way it's doing this check?  I see a lot of people here using PAF - I'd be surprised if such a bug weren't discovered already...
> 
>> Stopping an already stopped service does not return an error -- here,
>> the agent is saying it was unable to demote or stop a running instance.
> 
> I still don't understand.  There are *NO* postgres processes running on the node, no additions to the log file.  Nothing whatsoever that supports the notion that it's a running instance.
> 
>> Unfortunately clustering has some inherent complexity that gives it a
>> steep learning curve. On top of that, logging/troubleshooting
>> improvements are definitely an area of ongoing need in pacemaker. The
>> good news is that once a cluster is running successfully, it's usually
>> smooth sailing after that.
> 
> I hope so...  I just don't see what I'm doing that's outside of the standard box.  I've set up PAF following it's instructions.  I see that others here are using it.  Hasn't anybody else gotten such a setup working already?  I would think this is a pretty standard failure case that anybody would test if they've set up a cluster...  In any case, I'll keep persisting as long as I can...on to debugging...
> 
>> You can debug like this:
>>
>> 1. Unmanage the resource in pacemaker, so you can mess with it
>> manually.
>>
>> 2. Cause the desired failure for testing. Pacemaker should detect the
>> failure, but not do anything about it.
> 
> I executed `pcs resource unmanage postgresql-ha`, and then powered off the master node.

There is no "master node" in pacemaker. There is master/slave resource
so at the best it is "node on which specific resource has master role".
And we have no way to know which on which node you resource had master
role when you did it. Please be more specific, otherwise it is hard to
impossible to follow.

>  The fencing kicked in and restarted the node.  After the node rebooted, I issued a `pcs cluster start` on it as the crm_resource command complained about the CIB without doing that.
> 
> I then ended up seeing this:
> 
> ------
>  vfencing       (stonith:external/vcenter):     Started d-gp2-dbpg0-1
>  postgresql-master-vip  (ocf::heartbeat:IPaddr2):       Started d-gp2-dbpg0-2
>  Master/Slave Set: postgresql-ha [postgresql-10-main] (unmanaged)
>      postgresql-10-main (ocf::heartbeat:pgsqlms):       Started d-gp2-dbpg0-3 (unmanaged)
>      postgresql-10-main (ocf::heartbeat:pgsqlms):       Slave d-gp2-dbpg0-1 (unmanaged)

Not specifically related to your problem but I wonder what is the
difference. For all I know for master/slave "Started" == "Slave" so I'm
surprised to see two different states listed here.

>      postgresql-10-main (ocf::heartbeat:pgsqlms):       FAILED Master d-gp2-dbpg0-2 (unmanaged)
> 
> Failed Actions:
> * postgresql-10-main_monitor_0 on d-gp2-dbpg0-2 'master (failed)' (9): call=14, status=complete, exitreason='Instance "postgresql-10-main" controldata indicates a running primary instance, the instance has probably crashed',

Well, apparently resource agent does not like crashed instance. It is
quite possible, I have been working with another replicated database
where it was necessary to manually fix configuration after failover,
*outside* of pacemaker. Pacemaker simply failed to start resource which
had unexpected state.

This needs someone familiar with this RA and application to answer.

Note that it is not quite normal use case. You explicitly disabled any
handling by RA, thus effectively not using pacemaker high availability
at all. Does it fail over master if you do not unmanage resource and
kill node where resource has master role?

>     last-rc-change='Wed May 30 22:18:16 2018', queued=0ms, exec=190ms
> * postgresql-10-main_monitor_15000 on d-gp2-dbpg0-2 'master (failed)' (9): call=16, status=complete, exitreason='Instance "postgresql-10-main" controldata indicates a running primary instance, the instance has probably crashed',
>     last-rc-change='Wed May 30 22:18:16 2018', queued=0ms, exec=138ms
> ------
> 
>> 3. Run crm_resource with the -VV option and --force-* with whatever
>> action you want to attempt (in this case, demote or stop). The -VV (aka
>> --verbose --verbose) will turn on OCF_TRACE_RA. The --force-* command
>> will read the resource configuration and do the same thing pacemaker
>> would do to execute the command.
> 
> I thought that I would want to see what the "check" is doing to do the check, since you're telling me that it thinks the service is running when it's definitely not.  I tried the following command which didn't work (am I doing something wrong?):
> 
> ------
> root at d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-check 
>  warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
>  warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
> Operation monitor for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5
>  >  stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match (m//) at /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line 392.
>  >  stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and greater
> Error performing operation: Input/output error
> ------

This looks like a bug in your version.

> 
> Attempting to force-demote didn't work either:
> 
> ------
> root at d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-demote
>  warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
>  warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
> resource postgresql-ha is running on: d-gp2-dbpg0-3 
> resource postgresql-ha is running on: d-gp2-dbpg0-1 
> resource postgresql-ha is running on: d-gp2-dbpg0-2 Master
> It is not safe to demote postgresql-ha here: the cluster claims it is already active
> Try setting target-role=stopped first or specifying --force
> 
> root at d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-demote --force
>  warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
>  warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
> resource postgresql-ha is running on: d-gp2-dbpg0-3 
> resource postgresql-ha is running on: d-gp2-dbpg0-1 
> resource postgresql-ha is running on: d-gp2-dbpg0-2 Master
> Operation demote for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5
>  >  stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match (m//) at /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line 392.
>  >  stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and greater
> Error performing operation: Input/output error
> ------
> 
> Neither did force-stop:
> 
> ------
> root at d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-stop  
>  warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
>  warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
> Operation stop for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5
>  >  stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match (m//) at /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line 392.
>  >  stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and greater
> Error performing operation: Input/output error
> ------
> 
> Thanks,
>