[ClusterLabs] How to check if a resource on a cluster node is really back on after a crash
Ludovic Vaugeois-Pepin
ludovicvp at gmail.com
Mon May 15 05:26:54 EDT 2017
I will look into adding alerts, thanks for the info.
For now I introduced a 5 seconds sleep after "pcs cluster start ...". It
seems enough for monitor to be run.
On Fri, May 12, 2017 at 9:22 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
> Another possibility you might want to look into is alerts. Pacemaker can
> call a script of your choosing whenever a resource is started or
> stopped. See:
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-sing
> le/Pacemaker_Explained/index.html#idm139683940283296
>
> for the concepts, and the pcs man page for the "pcs alert" interface.
>
> On 05/12/2017 06:17 AM, Ludovic Vaugeois-Pepin wrote:
> > I checked the node_state of the node that is killed and brought back
> > (test3). in_ccm == true and crmd == online for a second or two between
> > "pcs cluster start test3" "monitor":
> >
> > <node_state id="3" uname="test3" in_ccm="true" crmd="online"
> > crm-debug-origin="peer_update_callback" join="member" expected="member">
> >
> >
> >
> > On Fri, May 12, 2017 at 11:27 AM, Ludovic Vaugeois-Pepin
> > <ludovicvp at gmail.com <mailto:ludovicvp at gmail.com>> wrote:
> >
> > Yes I haven't been using the "nodes" element in the XML, only the
> > "resources" element. I couldn't find "node_state" elements or
> > attributes in the XML, so after some searching I found that it is in
> > the CIB that can be gotten with "pcs cluster cib foo.xml". I will
> > start exploring this as an alternative to crm_mon/"pcs status".
> >
> >
> > However I still find what happens to be confusing, so below I try to
> > better explain what I see:
> >
> >
> > Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW
> > shutdown a minute ago):
> >
> > crm_mon -1:
> >
> > Stack: corosync
> > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -
> > partition with quorum
> > Last updated: Fri May 12 10:45:36 2017 Last change: Fri
> > May 12 09:18:13 2017 by root via crm_attribute on test1
> >
> > 3 nodes and 4 resources configured
> >
> > Online: [ test1 test2 ]
> > OFFLINE: [ test3 ]
> >
> > Active resources:
> >
> > Master/Slave Set: pgsql-ha [pgsqld]
> > Masters: [ test1 ]
> > Slaves: [ test2 ]
> > pgsql-master-ip (ocf::heartbeat:IPaddr2): Started
> > test1
> >
> >
> > crm_mon -X:
> >
> > <resources>
> > <clone id="pgsql-ha" multi_state="true" unique="false"
> > managed="true" failed="false" failure_ignored="false" >
> > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> > role="Master" active="true" orphaned="false" managed="true" f
> > ailed="false" failure_ignored="false" nodes_running_on="1" >
> > <node name="test1" id="1" cached="false"/>
> > </resource>
> > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> > role="Slave" active="true" orphaned="false" managed="true" fa
> > iled="false" failure_ignored="false" nodes_running_on="1" >
> > <node name="test2" id="2" cached="false"/>
> > </resource>
> > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> > role="Stopped" active="false" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="0" />
> > </clone>
> > <resource id="pgsql-master-ip"
> > resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true"
> > orphaned="false" managed
> > ="true" failed="false" failure_ignored="false"
> > nodes_running_on="1" >
> > <node name="test1" id="1" cached="false"/>
> > </resource>
> > </resources>
> >
> >
> >
> > At 10:45:39.440, after "pcs cluster start test3", before first
> > "monitor" on test3 (this is where I can't seem to know that
> > resources on test3 are down):
> >
> > crm_mon -1:
> >
> > Stack: corosync
> > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -
> > partition with quorum
> > Last updated: Fri May 12 10:45:39 2017 Last change: Fri
> > May 12 10:45:39 2017 by root via crm_attribute on test1
> >
> > 3 nodes and 4 resources configured
> >
> > Online: [ test1 test2 test3 ]
> >
> > Active resources:
> >
> > Master/Slave Set: pgsql-ha [pgsqld]
> > Masters: [ test1 ]
> > Slaves: [ test2 test3 ]
> > pgsql-master-ip (ocf::heartbeat:IPaddr2): Started
> > test1
> >
> >
> > crm_mon -X:
> >
> > <resources>
> > <clone id="pgsql-ha" multi_state="true" unique="false"
> > managed="true" failed="false" failure_ignored="false" >
> > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> > role="Master" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > <node name="test1" id="1" cached="false"/>
> > </resource>
> > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> > role="Slave" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > <node name="test2" id="2" cached="false"/>
> > </resource>
> > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> > role="Slave" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > <node name="test3" id="3" cached="false"/>
> > </resource>
> > </clone>
> > <resource id="pgsql-master-ip"
> > resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true"
> > orphaned="false" managed="true" failed="false"
> > failure_ignored="false" nodes_running_on="1" >
> > <node name="test1" id="1" cached="false"/>
> > </resource>
> > </resources>
> >
> >
> >
> > At 10:45:41.606, after first "monitor" on test3 (I can now tell the
> > resources on test3 are not ready):
> >
> > crm_mon -1:
> >
> > Stack: corosync
> > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -
> > partition with quorum
> > Last updated: Fri May 12 10:45:41 2017 Last change: Fri
> > May 12 10:45:39 2017 by root via crm_attribute on test1
> >
> > 3 nodes and 4 resources configured
> >
> > Online: [ test1 test2 test3 ]
> >
> > Active resources:
> >
> > Master/Slave Set: pgsql-ha [pgsqld]
> > Masters: [ test1 ]
> > Slaves: [ test2 ]
> > pgsql-master-ip (ocf::heartbeat:IPaddr2): Started
> > test1
> >
> >
> > crm_mon -X:
> >
> > <resources>
> > <clone id="pgsql-ha" multi_state="true" unique="false"
> > managed="true" failed="false" failure_ignored="false" >
> > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> > role="Master" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > <node name="test1" id="1" cached="false"/>
> > </resource>
> > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> > role="Slave" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > <node name="test2" id="2" cached="false"/>
> > </resource>
> > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
> > role="Stopped" active="false" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="0" />
> > </clone>
> > <resource id="pgsql-master-ip"
> > resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true"
> > orphaned="false" managed="true" failed="false"
> > failure_ignored="false" nodes_running_on="1" >
> > <node name="test1" id="1" cached="false"/>
> > </resource>
> > </resources>
> >
> > On Fri, May 12, 2017 at 12:45 AM, Ken Gaillot <kgaillot at redhat.com
> > <mailto:kgaillot at redhat.com>> wrote:
> >
> > On 05/11/2017 03:00 PM, Ludovic Vaugeois-Pepin wrote:
> > > Hi
> > > I translated the a Postgresql multi state RA
> > > (https://github.com/dalibo/PAF) in Python
> > > (https://github.com/ulodciv/deploy_cluster
> > <https://github.com/ulodciv/deploy_cluster>), and I have been
> > editing it
> > > heavily.
> > >
> > > In parallel I am writing unit tests and functional tests.
> > >
> > > I am having an issue with a functional test that abruptly
> > powers off a
> > > slave named says "host3" (hot standby PG instance). Later on I
> > start the
> > > slave back. Once it is started, I run "pcs cluster start
> > host3". And
> > > this is where I start having a problem.
> > >
> > > I check every second the output of "pcs status xml" until
> > host3 is said
> > > to be ready as a slave again. In the following I assume that
> > test3 is
> > > ready as a slave:
> > >
> > > <nodes>
> > > <node name="test1" id="1" online="true" standby="false"
> > > standby_onfail="false" maintenance="false" pending="false"
> > > unclean="false" shutdown="false" expected_up="true"
> is_dc="false"
> > > resources_running="2" type="member" />
> > > <node name="test2" id="2" online="true" standby="false"
> > > standby_onfail="false" maintenance="false" pending="false"
> > > unclean="false" shutdown="false" expected_up="true"
> is_dc="true"
> > > resources_running="1" type="member" />
> > > <node name="test3" id="3" online="true" standby="false"
> > > standby_onfail="false" maintenance="false" pending="false"
> > > unclean="false" shutdown="false" expected_up="true"
> is_dc="false"
> > > resources_running="1" type="member" />
> > > </nodes>
> >
> > The <nodes> section says nothing about the current state of the
> > nodes.
> > Look at the <node_state> entries for that. in_ccm means the
> cluster
> > stack level, and crmd means the pacemaker level -- both need to
> > be up.
> >
> > > <resources>
> > > <clone id="pgsql-ha" multi_state="true" unique="false"
> > > managed="true" failed="false" failure_ignored="false" >
> > > <resource id="pgsqld"
> resource_agent="ocf::heartbeat:pgha"
> > > role="Slave" active="true" orphaned="false" managed="true"
> > > failed="false" failure_ignored="false" nodes_running_on="1" >
> > > <node name="test3" id="3" cached="false"/>
> > > </resource>
> > > <resource id="pgsqld"
> resource_agent="ocf::heartbeat:pgha"
> > > role="Master" active="true" orphaned="false" managed="true"
> > > failed="false" failure_ignored="false" nodes_running_on="1" >
> > > <node name="test1" id="1" cached="false"/>
> > > </resource>
> > > <resource id="pgsqld"
> resource_agent="ocf::heartbeat:pgha"
> > > role="Slave" active="true" orphaned="false" managed="true"
> > > failed="false" failure_ignored="false" nodes_running_on="1" >
> > > <node name="test2" id="2" cached="false"/>
> > > </resource>
> > > </clone>
> > > By ready to go I mean that upon running "pcs cluster start
> test3", the
> > > following occurs before test3 appears ready in the XML:
> > >
> > > pcs cluster start test3
> > > monitor-> RA returns unknown error (1)
> > > notify/pre-stop -> RA returns ok (0)
> > > stop -> RA returns ok (0)
> > > start-> RA returns ok (0)
> > >
> > > The problem I have is that between "pcs cluster start test3"
> and
> > > "monitor", it seems that the XML returned by "pcs status xml"
> says test3
> > > is ready (the XML extract above is what I get at that moment).
> Once
> > > "monitor" occurs, the returned XML shows test3 to be offline,
> and not
> > > until the start is finished do I once again have test3 shown
> as ready.
> > >
> > > I am getting anything wrong? Is there a simpler or better way
> to check
> > > if test3 is fully functional again, ie OCF start was
> successful?
> > >
> > > Thanks
> > >
> > > Ludovic
>
--
Ludovic Vaugeois-Pepin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170515/5ea81a63/attachment-0003.html>
More information about the Users
mailing list