[ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

Fri May 12 19:22:22 UTC 2017

Another possibility you might want to look into is alerts. Pacemaker can
call a script of your choosing whenever a resource is started or
stopped. See:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm139683940283296

for the concepts, and the pcs man page for the "pcs alert" interface.

On 05/12/2017 06:17 AM, Ludovic Vaugeois-Pepin wrote:
> I checked the node_state of the node that is killed and brought back
> (test3). in_ccm == true and crmd == online for a second or two between
> "pcs cluster start test3" "monitor":
> 
>     <node_state id="3" uname="test3" in_ccm="true" crmd="online"
> crm-debug-origin="peer_update_callback" join="member" expected="member">
> 
> 
> 
> On Fri, May 12, 2017 at 11:27 AM, Ludovic Vaugeois-Pepin
> <ludovicvp at gmail.com <mailto:ludovicvp at gmail.com>> wrote:
> 
>     Yes I haven't been using the "nodes" element in the XML, only the
>     "resources" element. I couldn't find "node_state" elements or
>     attributes in the XML, so after some searching I found that it is in
>     the CIB that can be gotten with "pcs cluster cib foo.xml". I will
>     start exploring this as an alternative to  crm_mon/"pcs status".
> 
> 
>     However I still find what happens to be confusing, so below I try to
>     better explain what I see:
> 
> 
>     Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW
>     shutdown a minute ago):
> 
>     crm_mon -1:
> 
>         Stack: corosync
>         Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -
>     partition with quorum
>         Last updated: Fri May 12 10:45:36 2017          Last change: Fri
>     May 12 09:18:13 2017 by root via crm_attribute on test1
> 
>         3 nodes and 4 resources configured
> 
>         Online: [ test1 test2 ]
>         OFFLINE: [ test3 ]
> 
>         Active resources:
> 
>          Master/Slave Set: pgsql-ha [pgsqld]
>              Masters: [ test1 ]
>              Slaves: [ test2 ]
>          pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started
>     test1
> 
>          
>     crm_mon -X:
> 
>         <resources>
>         <clone id="pgsql-ha" multi_state="true" unique="false"
>     managed="true" failed="false" failure_ignored="false" >
>             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
>     role="Master" active="true" orphaned="false" managed="true" f
>         ailed="false" failure_ignored="false" nodes_running_on="1" >
>                 <node name="test1" id="1" cached="false"/>
>             </resource>
>             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
>     role="Slave" active="true" orphaned="false" managed="true" fa
>         iled="false" failure_ignored="false" nodes_running_on="1" >
>                 <node name="test2" id="2" cached="false"/>
>             </resource>
>             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
>     role="Stopped" active="false" orphaned="false" managed="true"
>         failed="false" failure_ignored="false" nodes_running_on="0" />
>         </clone>
>         <resource id="pgsql-master-ip"
>     resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true"
>     orphaned="false" managed
>         ="true" failed="false" failure_ignored="false"
>     nodes_running_on="1" >
>             <node name="test1" id="1" cached="false"/>
>         </resource>
>         </resources>
> 
> 
> 
>     At 10:45:39.440, after "pcs cluster start test3", before first
>     "monitor" on test3 (this is where I can't seem to know that
>     resources on test3 are down):
> 
>     crm_mon -1:
> 
>         Stack: corosync
>         Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -
>     partition with quorum
>         Last updated: Fri May 12 10:45:39 2017          Last change: Fri
>     May 12 10:45:39 2017 by root via crm_attribute on test1
> 
>         3 nodes and 4 resources configured
> 
>         Online: [ test1 test2 test3 ]
> 
>         Active resources:
> 
>          Master/Slave Set: pgsql-ha [pgsqld]
>              Masters: [ test1 ]
>              Slaves: [ test2 test3 ]
>          pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started
>     test1
> 
> 
>     crm_mon -X:
> 
>         <resources>
>         <clone id="pgsql-ha" multi_state="true" unique="false"
>     managed="true" failed="false" failure_ignored="false" >
>             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
>     role="Master" active="true" orphaned="false" managed="true"
>     failed="false" failure_ignored="false" nodes_running_on="1" >
>                 <node name="test1" id="1" cached="false"/>
>             </resource>
>             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
>     role="Slave" active="true" orphaned="false" managed="true"
>     failed="false" failure_ignored="false" nodes_running_on="1" >
>                 <node name="test2" id="2" cached="false"/>
>             </resource>
>             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
>     role="Slave" active="true" orphaned="false" managed="true"
>     failed="false" failure_ignored="false" nodes_running_on="1" >
>                 <node name="test3" id="3" cached="false"/>
>             </resource>
>         </clone>
>         <resource id="pgsql-master-ip"
>     resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true"
>     orphaned="false" managed="true" failed="false"
>     failure_ignored="false" nodes_running_on="1" >
>             <node name="test1" id="1" cached="false"/>
>         </resource>
>         </resources>
> 
> 
>         
>     At 10:45:41.606, after first "monitor" on test3 (I can now tell the
>     resources on test3 are not ready):
> 
>     crm_mon -1:
> 
>         Stack: corosync
>         Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -
>     partition with quorum
>         Last updated: Fri May 12 10:45:41 2017          Last change: Fri
>     May 12 10:45:39 2017 by root via crm_attribute on test1
> 
>         3 nodes and 4 resources configured
> 
>         Online: [ test1 test2 test3 ]
> 
>         Active resources:
> 
>          Master/Slave Set: pgsql-ha [pgsqld]
>              Masters: [ test1 ]
>              Slaves: [ test2 ]
>          pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started
>     test1
> 
> 
>     crm_mon -X:
> 
>         <resources>
>         <clone id="pgsql-ha" multi_state="true" unique="false"
>     managed="true" failed="false" failure_ignored="false" >
>             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
>     role="Master" active="true" orphaned="false" managed="true"
>     failed="false" failure_ignored="false" nodes_running_on="1" >
>                 <node name="test1" id="1" cached="false"/>
>             </resource>
>             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
>     role="Slave" active="true" orphaned="false" managed="true"
>     failed="false" failure_ignored="false" nodes_running_on="1" >
>                 <node name="test2" id="2" cached="false"/>
>             </resource>
>             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
>     role="Stopped" active="false" orphaned="false" managed="true"
>     failed="false" failure_ignored="false" nodes_running_on="0" />
>         </clone>
>         <resource id="pgsql-master-ip"
>     resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true"
>     orphaned="false" managed="true" failed="false"
>     failure_ignored="false" nodes_running_on="1" >
>             <node name="test1" id="1" cached="false"/>
>         </resource>
>         </resources>
> 
>     On Fri, May 12, 2017 at 12:45 AM, Ken Gaillot <kgaillot at redhat.com
>     <mailto:kgaillot at redhat.com>> wrote:
> 
>         On 05/11/2017 03:00 PM, Ludovic Vaugeois-Pepin wrote:
>         > Hi
>         > I translated the a Postgresql multi state RA
>         > (https://github.com/dalibo/PAF) in Python
>         > (https://github.com/ulodciv/deploy_cluster
>         <https://github.com/ulodciv/deploy_cluster>), and I have been
>         editing it
>         > heavily.
>         >
>         > In parallel I am writing unit tests and functional tests.
>         >
>         > I am having an issue with a functional test that abruptly
>         powers off a
>         > slave named says "host3" (hot standby PG instance). Later on I
>         start the
>         > slave back. Once it is started, I run "pcs cluster start
>         host3". And
>         > this is where I start having a problem.
>         >
>         > I check every second the output of "pcs status xml" until
>         host3 is said
>         > to be ready as a slave again. In the following I assume that
>         test3 is
>         > ready as a slave:
>         >
>         >     <nodes>
>         >         <node name="test1" id="1" online="true" standby="false"
>         > standby_onfail="false" maintenance="false" pending="false"
>         > unclean="false" shutdown="false" expected_up="true" is_dc="false"
>         > resources_running="2" type="member" />
>         >         <node name="test2" id="2" online="true" standby="false"
>         > standby_onfail="false" maintenance="false" pending="false"
>         > unclean="false" shutdown="false" expected_up="true" is_dc="true"
>         > resources_running="1" type="member" />
>         >         <node name="test3" id="3" online="true" standby="false"
>         > standby_onfail="false" maintenance="false" pending="false"
>         > unclean="false" shutdown="false" expected_up="true" is_dc="false"
>         > resources_running="1" type="member" />
>         >     </nodes>
> 
>         The <nodes> section says nothing about the current state of the
>         nodes.
>         Look at the <node_state> entries for that. in_ccm means the cluster
>         stack level, and crmd means the pacemaker level -- both need to
>         be up.
> 
>         >     <resources>
>         >         <clone id="pgsql-ha" multi_state="true" unique="false"
>         > managed="true" failed="false" failure_ignored="false" >
>         >             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
>         > role="Slave" active="true" orphaned="false" managed="true"
>         > failed="false" failure_ignored="false" nodes_running_on="1" >
>         >                 <node name="test3" id="3" cached="false"/>
>         >             </resource>
>         >             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
>         > role="Master" active="true" orphaned="false" managed="true"
>         > failed="false" failure_ignored="false" nodes_running_on="1" >
>         >                 <node name="test1" id="1" cached="false"/>
>         >             </resource>
>         >             <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
>         > role="Slave" active="true" orphaned="false" managed="true"
>         > failed="false" failure_ignored="false" nodes_running_on="1" >
>         >                 <node name="test2" id="2" cached="false"/>
>         >             </resource>
>         >         </clone>
>         > By ready to go I mean that upon running "pcs cluster start test3", the
>         > following occurs before test3 appears ready in the XML:
>         >
>         > pcs cluster start test3
>         > monitor-> RA returns unknown error (1)
>         > notify/pre-stop    -> RA returns ok (0)
>         > stop   -> RA returns ok (0)
>         > start-> RA returns ok (0)
>         >
>         > The problem I have is that between "pcs cluster start test3" and
>         > "monitor", it seems that the XML returned by "pcs status xml" says test3
>         > is ready (the XML extract above is what I get at that moment). Once
>         > "monitor" occurs, the returned XML shows test3 to be offline, and not
>         > until the start is finished do I once again have test3 shown as ready.
>         >
>         > I am getting anything wrong? Is there a simpler or better way to check
>         > if test3 is fully functional again, ie OCF start was successful?
>         >
>         > Thanks
>         >
>         > Ludovic