[ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

Thu May 31 16:20:45 UTC 2018

> There is no "master node" in pacemaker. There is master/slave resource
> so at the best it is "node on which specific resource has master role".
> And we have no way to know which on which node you resource had master
> role when you did it. Please be more specific, otherwise it is hard to
> impossible to follow.

Well my limited understanding is that there should be one node that's the master at any point in time.  I don't see how it makes sense to have resources with masters on different nodes in the same clusters.  I'm being as specific as I can given my limited knowledge.  I'm not a developer; just an admin trying to get a simple cluster up and running.  Years ago, I did this same thing with two nodes and heartbeat, and it was very easy.  Anyways, I guess I mean that I powered off the node that was the master for all resources at the time.

> Not specifically related to your problem but I wonder what is the
> difference. For all I know for master/slave "Started" == "Slave" so I'm
> surprised to see two different states listed here.

I also wondered about that, since from the PostgreSQL, there is one master and two standbys which are no different from one another.  But like you said, it didn't seem relevant to my problem.

> Well, apparently resource agent does not like crashed instance. It is
> quite possible, I have been working with another replicated database
> where it was necessary to manually fix configuration after failover,
> *outside* of pacemaker. Pacemaker simply failed to start resource which
> had unexpected state.

I can manually start up the database in standby mode, without any errors or special intervention/fixing whatsoever, as long as the replication logs have not gotten too far ahead on the new master.  In that case I would need to rebuild the standby.

> This needs someone familiar with this RA and application to answer.

The resource agent is PAF and I've seen a lot of others discussing this on this list, so I hope that I am asking in the right place.

> Note that it is not quite normal use case. You explicitly disabled any
> handling by RA, thus effectively not using pacemaker high availability
> at all. Does it fail over master if you do not unmanage resource and
> kill node where resource has master role?

I was following the specific instructions in the E-mail I was replying to, which asked me to unmanage the resource and try manual debugging steps.  As I've discussed in this thread (please review the previous E-mails on this thread for further information), pacemaker does fail over the master, but then when the former master node comes back online, if I do a `pcs cluster start` on it without manually starting up the database by hand, it fails to start the PAF resource and pacemaker ends up fencing the node again.

I've been told that what PAF does on resource startup is exactly the same as the manual commands that I can do to make it work.  In the prior E-mails on this thread, I was told that the reason the resource startup fails is because the resource agent is incorrectly determining that the resource is already running when it's not - so it's never even trying to start the resource at all.  The debug instructions I'm attempting to follow are in an attempt to figure out what command it is running to determine this state.  Fail over to another node is only half the battle - the failed node should be able to rejoin the cluster without the cluster immediately fencing it when I try, shouldn't it?

>> ------
>> root at d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-check 
>> warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
>> warning: unpack_rsc_op_failure:        Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
>> Operation monitor for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5
>>> stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match (m//) at /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line 392.
>>> stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and greater
>> Error performing operation: Input/output error
>> ------
> 
> This looks like a bug in your version.

Version of what?  I'm using the corosync, pacemaker, and pcs versions as provided by Ubuntu (for version 16.04), and resource-agents-paf as provided by the PGDG repository.

These versions are as follows:
* corosync - 2.3.5-3ubuntu2
* pacemaker - 1.1.14-2ubuntu1.3
* pcs - 0.9.149-1ubuntu1.1
* resource-agents-paf - 2.2.0-2.pgdg16.04+1

These are the latest packaged versions available for my platform, as far as I'm aware, and the same as I presume other Ubuntu users on this list are running.

Regards,
-- 
Casey