[ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

Fri May 12 07:17:03 EDT 2017

I checked the node_state of the node that is killed and brought back
(test3). in_ccm == true and crmd == online for a second or two between "pcs
cluster start test3" "monitor":

    <node_state id="3" uname="test3" in_ccm="true" crmd="online"
crm-debug-origin="peer_update_callback" join="member" expected="member">

On Fri, May 12, 2017 at 11:27 AM, Ludovic Vaugeois-Pepin <
ludovicvp at gmail.com> wrote:

> Yes I haven't been using the "nodes" element in the XML, only the
> "resources" element. I couldn't find "node_state" elements or attributes
> in the XML, so after some searching I found that it is in the CIB that can
> be gotten with "pcs cluster cib foo.xml". I will start exploring this as an
> alternative to  crm_mon/"pcs status".
>
>
> However I still find what happens to be confusing, so below I try to
> better explain what I see:
>
>
> Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW
> shutdown a minute ago):
>
> crm_mon -1:
>
>     Stack: corosync
>     Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with
> quorum
>     Last updated: Fri May 12 10:45:36 2017          Last change: Fri May
> 12 09:18:13 2017 by root via crm_attribute on test1
>
>     3 nodes and 4 resources configured
>
>     Online: [ test1 test2 ]
>     OFFLINE: [ test3 ]
>
>     Active resources:
>
>      Master/Slave Set: pgsql-ha [pgsqld]
>          Masters: [ test1 ]
>          Slaves: [ test2 ]
>      pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started test1
>
>
> crm_mon -X:
>
>     <resources>
>     <clone id="pgsql-ha" multi_state="true" unique="false" managed="true"
> failed="false" failure_ignored="false" >
>         <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> role="Master" active="true" orphaned="false" managed="true" f
>     ailed="false" failure_ignored="false" nodes_running_on="1" >
>             <node name="test1" id="1" cached="false"/>
>         </resource>
>         <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> role="Slave" active="true" orphaned="false" managed="true" fa
>     iled="false" failure_ignored="false" nodes_running_on="1" >
>             <node name="test2" id="2" cached="false"/>
>         </resource>
>         <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> role="Stopped" active="false" orphaned="false" managed="true"
>     failed="false" failure_ignored="false" nodes_running_on="0" />
>     </clone>
>     <resource id="pgsql-master-ip" resource_agent="ocf::heartbeat:IPaddr2"
> role="Started" active="true" orphaned="false" managed
>     ="true" failed="false" failure_ignored="false" nodes_running_on="1" >
>         <node name="test1" id="1" cached="false"/>
>     </resource>
>     </resources>
>
>
>
> At 10:45:39.440, after "pcs cluster start test3", before first "monitor"
> on test3 (this is where I can't seem to know that resources on test3 are
> down):
>
> crm_mon -1:
>
>     Stack: corosync
>     Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with
> quorum
>     Last updated: Fri May 12 10:45:39 2017          Last change: Fri May
> 12 10:45:39 2017 by root via crm_attribute on test1
>
>     3 nodes and 4 resources configured
>
>     Online: [ test1 test2 test3 ]
>
>     Active resources:
>
>      Master/Slave Set: pgsql-ha [pgsqld]
>          Masters: [ test1 ]
>          Slaves: [ test2 test3 ]
>      pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started test1
>
>
> crm_mon -X:
>
>     <resources>
>     <clone id="pgsql-ha" multi_state="true" unique="false" managed="true"
> failed="false" failure_ignored="false" >
>         <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> role="Master" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
>             <node name="test1" id="1" cached="false"/>
>         </resource>
>         <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> role="Slave" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
>             <node name="test2" id="2" cached="false"/>
>         </resource>
>         <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> role="Slave" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
>             <node name="test3" id="3" cached="false"/>
>         </resource>
>     </clone>
>     <resource id="pgsql-master-ip" resource_agent="ocf::heartbeat:IPaddr2"
> role="Started" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
>         <node name="test1" id="1" cached="false"/>
>     </resource>
>     </resources>
>
>
>
> At 10:45:41.606, after first "monitor" on test3 (I can now tell the
> resources on test3 are not ready):
>
> crm_mon -1:
>
>     Stack: corosync
>     Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with
> quorum
>     Last updated: Fri May 12 10:45:41 2017          Last change: Fri May
> 12 10:45:39 2017 by root via crm_attribute on test1
>
>     3 nodes and 4 resources configured
>
>     Online: [ test1 test2 test3 ]
>
>     Active resources:
>
>      Master/Slave Set: pgsql-ha [pgsqld]
>          Masters: [ test1 ]
>          Slaves: [ test2 ]
>      pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started test1
>
>
> crm_mon -X:
>
>     <resources>
>     <clone id="pgsql-ha" multi_state="true" unique="false" managed="true"
> failed="false" failure_ignored="false" >
>         <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> role="Master" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
>             <node name="test1" id="1" cached="false"/>
>         </resource>
>         <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> role="Slave" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
>             <node name="test2" id="2" cached="false"/>
>         </resource>
>         <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> role="Stopped" active="false" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="0" />
>     </clone>
>     <resource id="pgsql-master-ip" resource_agent="ocf::heartbeat:IPaddr2"
> role="Started" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
>         <node name="test1" id="1" cached="false"/>
>     </resource>
>     </resources>
>
> On Fri, May 12, 2017 at 12:45 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
>
>> On 05/11/2017 03:00 PM, Ludovic Vaugeois-Pepin wrote:
>> > Hi
>> > I translated the a Postgresql multi state RA
>> > (https://github.com/dalibo/PAF) in Python
>> > (https://github.com/ulodciv/deploy_cluster), and I have been editing it
>> > heavily.
>> >
>> > In parallel I am writing unit tests and functional tests.
>> >
>> > I am having an issue with a functional test that abruptly powers off a
>> > slave named says "host3" (hot standby PG instance). Later on I start the
>> > slave back. Once it is started, I run "pcs cluster start host3". And
>> > this is where I start having a problem.
>> >
>> > I check every second the output of "pcs status xml" until host3 is said
>> > to be ready as a slave again. In the following I assume that test3 is
>> > ready as a slave:
>> >
>> >     <nodes>
>> >         <node name="test1" id="1" online="true" standby="false"
>> > standby_onfail="false" maintenance="false" pending="false"
>> > unclean="false" shutdown="false" expected_up="true" is_dc="false"
>> > resources_running="2" type="member" />
>> >         <node name="test2" id="2" online="true" standby="false"
>> > standby_onfail="false" maintenance="false" pending="false"
>> > unclean="false" shutdown="false" expected_up="true" is_dc="true"
>> > resources_running="1" type="member" />
>> >         <node name="test3" id="3" online="true" standby="false"
>> > standby_onfail="false" maintenance="false" pending="false"
>> > unclean="false" shutdown="false" expected_up="true" is_dc="false"
>> > resources_running="1" type="member" />
>> >     </nodes>
>>
>> The <nodes> section says nothing about the current state of the nodes.
>> Look at the <node_state> entries for that. in_ccm means the cluster
>> stack level, and crmd means the pacemaker level -- both need to be up.
>>
>> >     <resources>
>> >         <clone id="pgsql-ha" multi_state="true" unique="false"
>> > managed="true" failed="false" failure_ignored="false" >
>> >             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
>> > role="Slave" active="true" orphaned="false" managed="true"
>> > failed="false" failure_ignored="false" nodes_running_on="1" >
>> >                 <node name="test3" id="3" cached="false"/>
>> >             </resource>
>> >             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
>> > role="Master" active="true" orphaned="false" managed="true"
>> > failed="false" failure_ignored="false" nodes_running_on="1" >
>> >                 <node name="test1" id="1" cached="false"/>
>> >             </resource>
>> >             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
>> > role="Slave" active="true" orphaned="false" managed="true"
>> > failed="false" failure_ignored="false" nodes_running_on="1" >
>> >                 <node name="test2" id="2" cached="false"/>
>> >             </resource>
>> >         </clone>
>> > By ready to go I mean that upon running "pcs cluster start test3", the
>> > following occurs before test3 appears ready in the XML:
>> >
>> > pcs cluster start test3
>> > monitor-> RA returns unknown error (1)
>> > notify/pre-stop    -> RA returns ok (0)
>> > stop   -> RA returns ok (0)
>> > start-> RA returns ok (0)
>> >
>> > The problem I have is that between "pcs cluster start test3" and
>> > "monitor", it seems that the XML returned by "pcs status xml" says test3
>> > is ready (the XML extract above is what I get at that moment). Once
>> > "monitor" occurs, the returned XML shows test3 to be offline, and not
>> > until the start is finished do I once again have test3 shown as ready.
>> >
>> > I am getting anything wrong? Is there a simpler or better way to check
>> > if test3 is fully functional again, ie OCF start was successful?
>> >
>> > Thanks
>> >
>> > Ludovic
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
>
> --
> Ludovic Vaugeois-Pepin
>

-- 
Ludovic Vaugeois-Pepin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170512/2adf10f7/attachment-0003.html>