<div dir="ltr">I will look into adding alerts, thanks for the info. <div><br></div><div>For now I introduced a 5 seconds sleep after "pcs cluster start ...". It seems enough for <span style="font-size:12.8px">monitor to be run.</span></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, May 12, 2017 at 9:22 PM, Ken Gaillot <span dir="ltr"><<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Another possibility you might want to look into is alerts. Pacemaker can<br>
call a script of your choosing whenever a resource is started or<br>
stopped. See:<br>
<br>
<a href="http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm139683940283296" rel="noreferrer" target="_blank">http://clusterlabs.org/doc/en-<wbr>US/Pacemaker/1.1-pcs/html-sing<wbr>le/Pacemaker_Explained/index.<wbr>html#idm139683940283296</a><br>
<br>
for the concepts, and the pcs man page for the "pcs alert" interface.<br>
<span><br>
On 05/12/2017 06:17 AM, Ludovic Vaugeois-Pepin wrote:<br>
> I checked the node_state of the node that is killed and brought back<br>
> (test3). in_ccm == true and crmd == online for a second or two between<br>
> "pcs cluster start test3" "monitor":<br>
><br>
> <node_state id="3" uname="test3" in_ccm="true" crmd="online"<br>
> crm-debug-origin="peer_update_<wbr>callback" join="member" expected="member"><br>
><br>
><br>
><br>
> On Fri, May 12, 2017 at 11:27 AM, Ludovic Vaugeois-Pepin<br>
</span><div><div class="m_-5704398829504974448h5">> <<a href="mailto:ludovicvp@gmail.com" target="_blank">ludovicvp@gmail.com</a> <mailto:<a href="mailto:ludovicvp@gmail.com" target="_blank">ludovicvp@gmail.com</a>>> wrote:<br>
><br>
> Yes I haven't been using the "nodes" element in the XML, only the<br>
> "resources" element. I couldn't find "node_state" elements or<br>
> attributes in the XML, so after some searching I found that it is in<br>
> the CIB that can be gotten with "pcs cluster cib foo.xml". I will<br>
> start exploring this as an alternative to crm_mon/"pcs status".<br>
><br>
><br>
> However I still find what happens to be confusing, so below I try to<br>
> better explain what I see:<br>
><br>
><br>
> Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW<br>
> shutdown a minute ago):<br>
><br>
> crm_mon -1:<br>
><br>
> Stack: corosync<br>
> Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -<br>
> partition with quorum<br>
> Last updated: Fri May 12 10:45:36 2017 Last change: Fri<br>
> May 12 09:18:13 2017 by root via crm_attribute on test1<br>
><br>
> 3 nodes and 4 resources configured<br>
><br>
> Online: [ test1 test2 ]<br>
> OFFLINE: [ test3 ]<br>
><br>
> Active resources:<br>
><br>
> Master/Slave Set: pgsql-ha [pgsqld]<br>
> Masters: [ test1 ]<br>
> Slaves: [ test2 ]<br>
> pgsql-master-ip (ocf::heartbeat:IPaddr2): Started<br>
> test1<br>
><br>
><br>
> crm_mon -X:<br>
><br>
> <resources><br>
> <clone id="pgsql-ha" multi_state="true" unique="false"<br>
> managed="true" failed="false" failure_ignored="false" ><br>
> <resource id="pgsqld" resource_agent="ocf::heartbeat<wbr>:pgha"<br>
> role="Master" active="true" orphaned="false" managed="true" f<br>
> ailed="false" failure_ignored="false" nodes_running_on="1" ><br>
> <node name="test1" id="1" cached="false"/><br>
> </resource><br>
> <resource id="pgsqld" resource_agent="ocf::heartbeat<wbr>:pgha"<br>
> role="Slave" active="true" orphaned="false" managed="true" fa<br>
> iled="false" failure_ignored="false" nodes_running_on="1" ><br>
> <node name="test2" id="2" cached="false"/><br>
> </resource><br>
> <resource id="pgsqld" resource_agent="ocf::heartbeat<wbr>:pgha"<br>
> role="Stopped" active="false" orphaned="false" managed="true"<br>
> failed="false" failure_ignored="false" nodes_running_on="0" /><br>
> </clone><br>
> <resource id="pgsql-master-ip"<br>
> resource_agent="ocf::heartbea<wbr>t:IPaddr2" role="Started" active="true"<br>
> orphaned="false" managed<br>
> ="true" failed="false" failure_ignored="false"<br>
> nodes_running_on="1" ><br>
> <node name="test1" id="1" cached="false"/><br>
> </resource><br>
> </resources><br>
><br>
><br>
><br>
> At 10:45:39.440, after "pcs cluster start test3", before first<br>
> "monitor" on test3 (this is where I can't seem to know that<br>
> resources on test3 are down):<br>
><br>
> crm_mon -1:<br>
><br>
> Stack: corosync<br>
> Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -<br>
> partition with quorum<br>
> Last updated: Fri May 12 10:45:39 2017 Last change: Fri<br>
> May 12 10:45:39 2017 by root via crm_attribute on test1<br>
><br>
> 3 nodes and 4 resources configured<br>
><br>
> Online: [ test1 test2 test3 ]<br>
><br>
> Active resources:<br>
><br>
> Master/Slave Set: pgsql-ha [pgsqld]<br>
> Masters: [ test1 ]<br>
> Slaves: [ test2 test3 ]<br>
> pgsql-master-ip (ocf::heartbeat:IPaddr2): Started<br>
> test1<br>
><br>
><br>
> crm_mon -X:<br>
><br>
> <resources><br>
> <clone id="pgsql-ha" multi_state="true" unique="false"<br>
> managed="true" failed="false" failure_ignored="false" ><br>
> <resource id="pgsqld" resource_agent="ocf::heartbeat<wbr>:pgha"<br>
> role="Master" active="true" orphaned="false" managed="true"<br>
> failed="false" failure_ignored="false" nodes_running_on="1" ><br>
> <node name="test1" id="1" cached="false"/><br>
> </resource><br>
> <resource id="pgsqld" resource_agent="ocf::heartbeat<wbr>:pgha"<br>
> role="Slave" active="true" orphaned="false" managed="true"<br>
> failed="false" failure_ignored="false" nodes_running_on="1" ><br>
> <node name="test2" id="2" cached="false"/><br>
> </resource><br>
> <resource id="pgsqld" resource_agent="ocf::heartbeat<wbr>:pgha"<br>
> role="Slave" active="true" orphaned="false" managed="true"<br>
> failed="false" failure_ignored="false" nodes_running_on="1" ><br>
> <node name="test3" id="3" cached="false"/><br>
> </resource><br>
> </clone><br>
> <resource id="pgsql-master-ip"<br>
> resource_agent="ocf::heartbea<wbr>t:IPaddr2" role="Started" active="true"<br>
> orphaned="false" managed="true" failed="false"<br>
> failure_ignored="false" nodes_running_on="1" ><br>
> <node name="test1" id="1" cached="false"/><br>
> </resource><br>
> </resources><br>
><br>
><br>
><br>
> At 10:45:41.606, after first "monitor" on test3 (I can now tell the<br>
> resources on test3 are not ready):<br>
><br>
> crm_mon -1:<br>
><br>
> Stack: corosync<br>
> Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -<br>
> partition with quorum<br>
> Last updated: Fri May 12 10:45:41 2017 Last change: Fri<br>
> May 12 10:45:39 2017 by root via crm_attribute on test1<br>
><br>
> 3 nodes and 4 resources configured<br>
><br>
> Online: [ test1 test2 test3 ]<br>
><br>
> Active resources:<br>
><br>
> Master/Slave Set: pgsql-ha [pgsqld]<br>
> Masters: [ test1 ]<br>
> Slaves: [ test2 ]<br>
> pgsql-master-ip (ocf::heartbeat:IPaddr2): Started<br>
> test1<br>
><br>
><br>
> crm_mon -X:<br>
><br>
> <resources><br>
> <clone id="pgsql-ha" multi_state="true" unique="false"<br>
> managed="true" failed="false" failure_ignored="false" ><br>
> <resource id="pgsqld" resource_agent="ocf::heartbeat<wbr>:pgha"<br>
> role="Master" active="true" orphaned="false" managed="true"<br>
> failed="false" failure_ignored="false" nodes_running_on="1" ><br>
> <node name="test1" id="1" cached="false"/><br>
> </resource><br>
> <resource id="pgsqld" resource_agent="ocf::heartbeat<wbr>:pgha"<br>
> role="Slave" active="true" orphaned="false" managed="true"<br>
> failed="false" failure_ignored="false" nodes_running_on="1" ><br>
> <node name="test2" id="2" cached="false"/><br>
> </resource><br>
> <resource id="pgsqld" resource_agent="ocf::heartbeat<wbr>:pgha"<br>
> role="Stopped" active="false" orphaned="false" managed="true"<br>
> failed="false" failure_ignored="false" nodes_running_on="0" /><br>
> </clone><br>
> <resource id="pgsql-master-ip"<br>
> resource_agent="ocf::heartbea<wbr>t:IPaddr2" role="Started" active="true"<br>
> orphaned="false" managed="true" failed="false"<br>
> failure_ignored="false" nodes_running_on="1" ><br>
> <node name="test1" id="1" cached="false"/><br>
> </resource><br>
> </resources><br>
><br>
> On Fri, May 12, 2017 at 12:45 AM, Ken Gaillot <<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a><br>
</div></div><span>> <mailto:<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>>> wrote:<br>
><br>
> On 05/11/2017 03:00 PM, Ludovic Vaugeois-Pepin wrote:<br>
> > Hi<br>
> > I translated the a Postgresql multi state RA<br>
> > (<a href="https://github.com/dalibo/PAF" rel="noreferrer" target="_blank">https://github.com/dalibo/PAF</a><wbr>) in Python<br>
> > (<a href="https://github.com/ulodciv/deploy_cluster" rel="noreferrer" target="_blank">https://github.com/ulodciv/de<wbr>ploy_cluster</a><br>
</span>> <<a href="https://github.com/ulodciv/deploy_cluster" rel="noreferrer" target="_blank">https://github.com/ulodciv/d<wbr>eploy_cluster</a>>), and I have been<br>
<div class="m_-5704398829504974448HOEnZb"><div class="m_-5704398829504974448h5">> editing it<br>
> > heavily.<br>
> ><br>
> > In parallel I am writing unit tests and functional tests.<br>
> ><br>
> > I am having an issue with a functional test that abruptly<br>
> powers off a<br>
> > slave named says "host3" (hot standby PG instance). Later on I<br>
> start the<br>
> > slave back. Once it is started, I run "pcs cluster start<br>
> host3". And<br>
> > this is where I start having a problem.<br>
> ><br>
> > I check every second the output of "pcs status xml" until<br>
> host3 is said<br>
> > to be ready as a slave again. In the following I assume that<br>
> test3 is<br>
> > ready as a slave:<br>
> ><br>
> > <nodes><br>
> > <node name="test1" id="1" online="true" standby="false"<br>
> > standby_onfail="false" maintenance="false" pending="false"<br>
> > unclean="false" shutdown="false" expected_up="true" is_dc="false"<br>
> > resources_running="2" type="member" /><br>
> > <node name="test2" id="2" online="true" standby="false"<br>
> > standby_onfail="false" maintenance="false" pending="false"<br>
> > unclean="false" shutdown="false" expected_up="true" is_dc="true"<br>
> > resources_running="1" type="member" /><br>
> > <node name="test3" id="3" online="true" standby="false"<br>
> > standby_onfail="false" maintenance="false" pending="false"<br>
> > unclean="false" shutdown="false" expected_up="true" is_dc="false"<br>
> > resources_running="1" type="member" /><br>
> > </nodes><br>
><br>
> The <nodes> section says nothing about the current state of the<br>
> nodes.<br>
> Look at the <node_state> entries for that. in_ccm means the cluster<br>
> stack level, and crmd means the pacemaker level -- both need to<br>
> be up.<br>
><br>
> > <resources><br>
> > <clone id="pgsql-ha" multi_state="true" unique="false"<br>
> > managed="true" failed="false" failure_ignored="false" ><br>
> > <resource id="pgsqld" resource_agent="ocf::heartbeat<wbr>:pgha"<br>
> > role="Slave" active="true" orphaned="false" managed="true"<br>
> > failed="false" failure_ignored="false" nodes_running_on="1" ><br>
> > <node name="test3" id="3" cached="false"/><br>
> > </resource><br>
> > <resource id="pgsqld" resource_agent="ocf::heartbeat<wbr>:pgha"<br>
> > role="Master" active="true" orphaned="false" managed="true"<br>
> > failed="false" failure_ignored="false" nodes_running_on="1" ><br>
> > <node name="test1" id="1" cached="false"/><br>
> > </resource><br>
> > <resource id="pgsqld" resource_agent="ocf::heartbeat<wbr>:pgha"<br>
> > role="Slave" active="true" orphaned="false" managed="true"<br>
> > failed="false" failure_ignored="false" nodes_running_on="1" ><br>
> > <node name="test2" id="2" cached="false"/><br>
> > </resource><br>
> > </clone><br>
> > By ready to go I mean that upon running "pcs cluster start test3", the<br>
> > following occurs before test3 appears ready in the XML:<br>
> ><br>
> > pcs cluster start test3<br>
> > monitor-> RA returns unknown error (1)<br>
> > notify/pre-stop -> RA returns ok (0)<br>
> > stop -> RA returns ok (0)<br>
> > start-> RA returns ok (0)<br>
> ><br>
> > The problem I have is that between "pcs cluster start test3" and<br>
> > "monitor", it seems that the XML returned by "pcs status xml" says test3<br>
> > is ready (the XML extract above is what I get at that moment). Once<br>
> > "monitor" occurs, the returned XML shows test3 to be offline, and not<br>
> > until the start is finished do I once again have test3 shown as ready.<br>
> ><br>
> > I am getting anything wrong? Is there a simpler or better way to check<br>
> > if test3 is fully functional again, ie OCF start was successful?<br>
> ><br>
> > Thanks<br>
> ><br>
> > Ludovic<br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="m_-5704398829504974448gmail_signature" data-smartmail="gmail_signature">Ludovic Vaugeois-Pepin<br></div>
</div></div>