[ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

Thu May 11 16:00:12 EDT 2017

Hi
I translated the a Postgresql multi state RA (https://github.com/dalibo/PAF)
in Python (https://github.com/ulodciv/deploy_cluster), and I have been
editing it heavily.

In parallel I am writing unit tests and functional tests.

I am having an issue with a functional test that abruptly powers off a
slave named says "host3" (hot standby PG instance). Later on I start the
slave back. Once it is started, I run "pcs cluster start host3". And this
is where I start having a problem.

I check every second the output of "pcs status xml" until host3 is said to
be ready as a slave again. In the following I assume that test3 is ready as
a slave:

    <nodes>
        <node name="test1" id="1" online="true" standby="false"
standby_onfail="false" maintenance="false" pending="false" unclean="false"
shutdown="false" expected_up="true" is_dc="false" resources_running="2"
type="member" />
        <node name="test2" id="2" online="true" standby="false"
standby_onfail="false" maintenance="false" pending="false" unclean="false"
shutdown="false" expected_up="true" is_dc="true" resources_running="1"
type="member" />
        <node name="test3" id="3" online="true" standby="false"
standby_onfail="false" maintenance="false" pending="false" unclean="false"
shutdown="false" expected_up="true" is_dc="false" resources_running="1"
type="member" />
    </nodes>
    <resources>
        <clone id="pgsql-ha" multi_state="true" unique="false"
managed="true" failed="false" failure_ignored="false" >
            <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
role="Slave" active="true" orphaned="false" managed="true" failed="false"
failure_ignored="false" nodes_running_on="1" >
                <node name="test3" id="3" cached="false"/>
            </resource>
            <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
role="Master" active="true" orphaned="false" managed="true" failed="false"
failure_ignored="false" nodes_running_on="1" >
                <node name="test1" id="1" cached="false"/>
            </resource>
            <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
role="Slave" active="true" orphaned="false" managed="true" failed="false"
failure_ignored="false" nodes_running_on="1" >
                <node name="test2" id="2" cached="false"/>
            </resource>
        </clone>
By ready to go I mean that upon running "pcs cluster start test3", the
following occurs before test3 appears ready in the XML:

pcs cluster start test3
monitor -> RA returns unknown error (1)
notify/pre-stop     -> RA returns ok (0)
stop     -> RA returns ok (0)
start -> RA returns ok (0)

The problem I have is that between "pcs cluster start test3" and "monitor",
it seems that the XML returned by "pcs status xml" says test3 is ready (the
XML extract above is what I get at that moment). Once "monitor" occurs, the
returned XML shows test3 to be offline, and not until the start is finished
do I once again have test3 shown as ready.

I am getting anything wrong? Is there a simpler or better way to check if
test3 is fully functional again, ie OCF start was successful?

Thanks

Ludovic
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20170511/6c9747a2/attachment-0002.html>