[ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

Thu May 31 15:26:07 EDT 2018

31.05.2018 19:20, Casey & Gina пишет:
>> There is no "master node" in pacemaker. There is master/slave
>> resource so at the best it is "node on which specific resource has
>> master role". And we have no way to know which on which node you
>> resource had master role when you did it. Please be more specific,
>> otherwise it is hard to impossible to follow.
> 
> Well my limited understanding is that there should be one node that's
> the master at any point in time.  I don't see how it makes sense to
> have resources with masters on different nodes in the same clusters.

It is entirely possible and useful for different resources to have
master role on different nodes at the same time. "Master" simply denotes
one of two possible state, it does not convey any additional semantic.

> I'm being as specific as I can given my limited knowledge.  I'm not a
> developer; just an admin trying to get a simple cluster up and
> running.  Years ago, I did this same thing with two nodes and
> heartbeat, and it was very easy.  Anyways, I guess I mean that I
> powered off the node that was the master for all resources at the
> time.
> 
>> Not specifically related to your problem but I wonder what is the 
>> difference. For all I know for master/slave "Started" == "Slave" so
>> I'm surprised to see two different states listed here.
> 
> I also wondered about that, since from the PostgreSQL, there is one
> master and two standbys which are no different from one another.  But
> like you said, it didn't seem relevant to my problem.
> 
>> Well, apparently resource agent does not like crashed instance. It
>> is quite possible, I have been working with another replicated
>> database where it was necessary to manually fix configuration after
>> failover, *outside* of pacemaker. Pacemaker simply failed to start
>> resource which had unexpected state.
> 
> I can manually start up the database in standby mode, without any
> errors or special intervention/fixing whatsoever, as long as the
> replication logs have not gotten too far ahead on the new master.  In
> that case I would need to rebuild the standby.
> 
>> This needs someone familiar with this RA and application to
>> answer.
> 
> The resource agent is PAF and I've seen a lot of others discussing
> this on this list, so I hope that I am asking in the right place.
> 

Sure, hopefully the right person chimes in.

>> Note that it is not quite normal use case. You explicitly disabled
>> any handling by RA, thus effectively not using pacemaker high
>> availability at all. Does it fail over master if you do not
>> unmanage resource and kill node where resource has master role?
> 
> I was following the specific instructions in the E-mail I was
> replying to, which asked me to unmanage the resource and try manual
> debugging steps.  As I've discussed in this thread (please review the
> previous E-mails on this thread for further information), pacemaker
> does fail over the master, but then when the former master node comes
> back online, if I do a `pcs cluster start` on it without manually
> starting up the database by hand, it fails to start the PAF resource
> and pacemaker ends up fencing the node again.
> 

Well, it means you now have new primary database instance (which was
failed over by pacemaker) on node A and old primary database instance on
node B which you now start. On node B it remains primary because that
was the sate in which node was killed. It is quite logical that attempt
to start resource (and hence database instance) fails.

Quick look at PAF manual gives

you need to rebuild the PostgreSQL instance on the failed node

did you do it? I am not intimately familiar with Postgres, but in this
case I expect that you need to make database on node B secondary (slave,
whatever it is called) to new master on node A. That is exactly what I
described as "manually fixing configuration outside of pacemaker".

> I've been told that what PAF does on resource startup is exactly the
> same as the manual commands that I can do to make it work.  In the
> prior E-mails on this thread, I was told that the reason the resource
> startup fails is because the resource agent is incorrectly
> determining that the resource is already running when it's not - so
> it's never even trying to start the resource at all.  The debug
> instructions I'm attempting to follow are in an attempt to figure out
> what command it is running to determine this state.  Fail over to
> another node is only half the battle - the failed node should be able
> to rejoin the cluster without the cluster immediately fencing it when
> I try, shouldn't it?
> 
>>> ------ root at d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV
>>> --force-check warning: unpack_rsc_op_failure:        Processing
>>> failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2:
>>> master (failed) (9) warning: unpack_rsc_op_failure:
>>> Processing failed op monitor for postgresql-10-main:2 on
>>> d-gp2-dbpg0-2: master (failed) (9) Operation monitor for
>>> postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5
>>>> stderr: Use of uninitialized value $OCF_Functions::ARG[0] in
>>>> pattern match (m//) at
>>>> /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm
>>>> line 392. stderr: ocf-exit-reason:PAF v2.2.0 is compatible with
>>>> Pacemaker 1.1.13 and greater
>>> Error performing operation: Input/output error ------
>> 
>> This looks like a bug in your version.
> 
> Version of what?  I'm using the corosync, pacemaker, and pcs versions
> as provided by Ubuntu (for version 16.04), and resource-agents-paf as
> provided by the PGDG repository.
> 
> These versions are as follows: * corosync - 2.3.5-3ubuntu2 *
> pacemaker - 1.1.14-2ubuntu1.3 * pcs - 0.9.149-1ubuntu1.1 *
> resource-agents-paf - 2.2.0-2.pgdg16.04+1
> 
> These are the latest packaged versions available for my platform, as
> far as I'm aware, and the same as I presume other Ubuntu users on
> this list are running.
> 

pacemaker is too old. The error most likely comes from missing
OCF_RESKEY_crm_feature_set which is exported by crm_resource starting
with 1.1.17. I am not that familiar with debian packaging, but I'd
expect resource-agents-paf require suitable pacemaker version. Of course
Ubuntu package may be patched to include necessary code ...