[ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

Thu May 11 22:45:06 UTC 2017

On 05/11/2017 03:00 PM, Ludovic Vaugeois-Pepin wrote:
> Hi
> I translated the a Postgresql multi state RA
> (https://github.com/dalibo/PAF) in Python
> (https://github.com/ulodciv/deploy_cluster), and I have been editing it
> heavily.
> 
> In parallel I am writing unit tests and functional tests.
> 
> I am having an issue with a functional test that abruptly powers off a
> slave named says "host3" (hot standby PG instance). Later on I start the
> slave back. Once it is started, I run "pcs cluster start host3". And
> this is where I start having a problem.
> 
> I check every second the output of "pcs status xml" until host3 is said
> to be ready as a slave again. In the following I assume that test3 is
> ready as a slave:
> 
>     <nodes>
>         <node name="test1" id="1" online="true" standby="false"
> standby_onfail="false" maintenance="false" pending="false"
> unclean="false" shutdown="false" expected_up="true" is_dc="false"
> resources_running="2" type="member" />
>         <node name="test2" id="2" online="true" standby="false"
> standby_onfail="false" maintenance="false" pending="false"
> unclean="false" shutdown="false" expected_up="true" is_dc="true"
> resources_running="1" type="member" />
>         <node name="test3" id="3" online="true" standby="false"
> standby_onfail="false" maintenance="false" pending="false"
> unclean="false" shutdown="false" expected_up="true" is_dc="false"
> resources_running="1" type="member" />
>     </nodes>

The <nodes> section says nothing about the current state of the nodes.
Look at the <node_state> entries for that. in_ccm means the cluster
stack level, and crmd means the pacemaker level -- both need to be up.

>     <resources>
>         <clone id="pgsql-ha" multi_state="true" unique="false"
> managed="true" failed="false" failure_ignored="false" >
>             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> role="Slave" active="true" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="1" >
>                 <node name="test3" id="3" cached="false"/>
>             </resource>
>             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> role="Master" active="true" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="1" >
>                 <node name="test1" id="1" cached="false"/>
>             </resource>
>             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> role="Slave" active="true" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="1" >
>                 <node name="test2" id="2" cached="false"/>
>             </resource>
>         </clone>
> By ready to go I mean that upon running "pcs cluster start test3", the
> following occurs before test3 appears ready in the XML:
> 
> pcs cluster start test3
> monitor-> RA returns unknown error (1) 
> notify/pre-stop    -> RA returns ok (0)
> stop   -> RA returns ok (0)
> start-> RA returns ok (0)
> 
> The problem I have is that between "pcs cluster start test3" and
> "monitor", it seems that the XML returned by "pcs status xml" says test3
> is ready (the XML extract above is what I get at that moment). Once
> "monitor" occurs, the returned XML shows test3 to be offline, and not
> until the start is finished do I once again have test3 shown as ready.
> 
> I am getting anything wrong? Is there a simpler or better way to check
> if test3 is fully functional again, ie OCF start was successful?
> 
> Thanks
> 
> Ludovic