[Pacemaker] Question on ILO stonith resource config and restarting

Wed Nov 5 05:46:06 EST 2008

Hi,

On Tue, Nov 04, 2008 at 02:20:42PM -0500, Aaron Bush wrote:
> Thanks for taking a look into this more.
> 
> I have pulled down the 'tip' version of Linux-HA and copied over the new
> ./lib/plugins/stonith/external/riloe into the system install path (did a
> diff and there are significant changes).
> Rebooted both nodes in this cluster.
> Started same test again... Node 1 loses primary network connection to
> LAN, thereby not able to get status or connect to the Stonith device
> (ILO) for Node 2.
> 
> The monitor process for the riloe appears to timeout and it is still
> downhill from there (here are log entries from Node1 who lost the
> network connection):
> 
> 
> Nov  4 13:25:28 wwwlb01 kernel: bnx2: eth0 NIC Copper Link is Down
> Nov  4 13:25:58 wwwlb01 lrmd: [8224]: WARN: cl_stonith_lb02:0:monitor
> process (PID 9213) timed out (try 1).  Killing with signal SIGTERM (15).
> Nov  4 13:25:58 wwwlb01 lrmd: [9213]: ERROR: stonithd_receive_ops_result
> failed.

This has been fixed: fix included in pacemaker 1.0. Though it
makes no difference here.

> Nov  4 13:25:58 wwwlb01 lrmd: [8224]: WARN: mapped the invalid return
> code 254.
> Nov  4 13:25:58 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM
> operation cl_stonith_lb02:0_monitor_30000 (call=10, rc=1) complete
> ...
> Nov  4 13:25:59 wwwlb01 crmd: [8227]: info: do_lrm_rsc_op: Performing
> op=cl_stonith_lb02:0_stop_0
> key=5:3:0:1eb0bdb2-c828-4b6d-b712-cf7049c775df)
> Nov  4 13:25:59 wwwlb01 lrmd: [8224]: info: rsc:cl_stonith_lb02:0: stop
> ...
> Nov  4 13:25:59 wwwlb01 lrmd: [9898]: info: Try to stop STONITH resource
> <rsc_id=cl_stonith_lb02:0> : Device=external/riloe
> ...
> Nov  4 13:26:00 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM
> operation cl_stonith_lb02:0_monitor_30000 (call=10, rc=-2) Cancelled
> Nov  4 13:26:00 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM
> operation cl_stonith_lb02:0_stop_0 (call=12, rc=0) complete
> Nov  4 13:26:01 wwwlb01 crmd: [8227]: info: do_lrm_rsc_op: Performing
> op=cl_stonith_lb02:0_start_0
> key=19:3:0:1eb0bdb2-c828-4b6d-b712-cf7049c775df)
> Nov  4 13:26:01 wwwlb01 lrmd: [8224]: info: rsc:cl_stonith_lb02:0: start
> Nov  4 13:26:01 wwwlb01 lrmd: [9902]: info: Try to start STONITH
> resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe
> Nov  4 13:26:01 wwwlb01 stonithd: [8225]: info: Cannot get parameter
> ilo_can_reset from StonithNVpair
> Nov  4 13:26:01 wwwlb01 stonithd: [8225]: info: Cannot get parameter
> ilo_protocol from StonithNVpair
> Nov  4 13:26:01 wwwlb01 stonithd: [8225]: info: Cannot get parameter
> ilo_powerdown_method from StonithNVpair
> ...
> Nov  4 13:26:13 wwwlb01 stonithd: [9904]: info: external_run_cmd:
> Calling '/usr/lib64/stonith/plugins/external/riloe status' returned 256
> Nov  4 13:26:13 wwwlb01 stonithd: [8225]: WARN: start cl_stonith_lb02:0
> failed, because its hostlist is empty
> Nov  4 13:26:13 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM
> operation cl_stonith_lb02:0_start_0 (call=13, rc=1) complete
> Nov  4 13:26:14 wwwlb01 crmd: [8227]: info: do_lrm_rsc_op: Performing
> op=cl_stonith_lb02:0_stop_0
> key=4:4:0:1eb0bdb2-c828-4b6d-b712-cf7049c775df)
> Nov  4 13:26:14 wwwlb01 lrmd: [8224]: info: rsc:cl_stonith_lb02:0: stop
> Nov  4 13:26:14 wwwlb01 lrmd: [9917]: info: Try to stop STONITH resource
> <rsc_id=cl_stonith_lb02:0> : Device=external/riloe
> Nov  4 13:26:14 wwwlb01 stonithd: [8225]: notice: try to stop a resource
> cl_stonith_lb02:0 who is not in started resource queue.
> Nov  4 13:26:14 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM
> operation cl_stonith_lb02:0_stop_0 (call=14, rc=0) complete
> Nov  4 13:26:19 wwwlb01 cib: [8223]: info: cib_stats: Processed 44
> operations (3409.00us average, 0% utilization) in the last 10min
> Nov  4 13:27:34 wwwlb01 kernel: bnx2: eth0 NIC Copper Link is Up, 100
> Mbps full duplex
> Nov  4 13:27:35 wwwlb01 heartbeat: [5969]: info: Link
> wwwlb02.microcenter.com:eth0 up.
> 
> In playing with the riloe python script I assume that the call to
> HTTPSConnection is hanging and then being later killed by lrmd.

BTW, did you try to test your ilo device with the stonith
program. Use -d to get debugging output.

> It
> looks like Python 2.6 added a timeout argument to the HTTPSConnection
> call.  The system is running 2.4.3 so I couldn't test it.  I do see that
> the socket timeout can be set like this:
> 	socket.setdefaulttimeout(1)
> I will follow this up by saying that my Python skills are very rusty.

I'd prefer to have the upper layer (stonithd) timeout. Why do
you think that this would help?

> I am trying to find out what the expected behavior should be for a
> timeout on a start or monitor command.

A timeout on start is actually a timeout on monitor. Every
stonith start includes a monitor operation. Otherwise, start
should've been named "enable" for stonith resources.

> Should Stonith agents follow the
> OCF resource agent specs?

OCF class != stonith class.

If your stonith device is ok and you can use it with the stonith
program successfully, then please file a bugzilla and attach a
hb_report generated report.

Thanks,

Dejan

> Thanks,
> -ab
> 
> -----Original Message-----
> From: pacemaker-bounces at clusterlabs.org
> [mailto:pacemaker-bounces at clusterlabs.org] On Behalf Of Dejan
> Muhamedagic
> Sent: Tuesday, November 04, 2008 11:26 AM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Question on ILO stonith resource config and
> restarting
> 
> On Thu, Oct 30, 2008 at 03:07:24PM -0400, Aaron Bush wrote:
> > Just realized that I only included the log entries from the node that
> > was not experiencing a network disconnect.  Attached are the log
> entries
> > from the node (01) that had the stonith resource running before the
> > cable disconnect and looks like they provide some more useful
> > information.  Also included up through when the network cable was
> > reconnected.
> 
> The monitor operation on riloe failed. You should definitely
> upgrade.
> 
> Thanks,
> 
> Dejan
> 
> > 
> > -ab
> > 
> > >> I have a 0.6 pacemaker/heartbeat cluster setup in a lab with
> > resources
> > >> as follows:
> > >> 
> > >> Group-lvs(ordered): two primitives -> ocf/IPddr2 and
> ocf/ldirectord.
> > >> Clone-pingd: set to monitor a couple of Ips and used to set a
> weight
> > for
> > >> where to run the LVS group.
> > >> 
> > >> -- This is the area that I have a question on --
> > >> Clone-stonith-node1: HP ILO to shoot node1
> > >> Clone-stonith-node2: HP ILO to shoot node2
> > >> 
> > >> I read on the old linux-ha site that using a clone for ILO/stonith
> > was
> > >> the way to go.  I'm not sure I see how this would work correctly
> and
> > be
> > >> preferred over a standard resource.  What I am confused about is
> > this:
> > >> the external/riloe stonith plugin only knows how to shoot one node
> so
> > >
> > >Please make sure that you use the latest edition of
> > >external/riloe. The previous one didn't work under all
> > >circumstances.
> > 
> > I am using the version that came with heartbeat-common-2.99.0-3.1
> > (according rpm -qf)
> > 
> > To clear my current issue where the stonith resource was not started
> > (and since this is still in the lab) I have rebooted both nodes to
> start
> > with a somewhat clean slate.  I have attempted to grab some more
> useful
> > information from the logs on why the resource is not restarting from.
> > Again I disconnect the LAN cable connecting a node to the rest of the
> > network (a private HB channel is still available and the ILO is still
> > up).  I noticed these entries in the log:
> > 
> > Oct 30 13:33:07 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing
> > op=cl_stonith_lb02:0_start_0
> > key=18:7:0:efbdb124-d51a-4228-80bc-7a9464d7971a)
> > Oct 30 13:33:07 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0:
> start
> > Oct 30 13:33:07 wwwlb02 lrmd: [30788]: info: Try to start STONITH
> > resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe
> > Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter
> > ilo_can_reset from StonithNVpair
> > Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter
> > ilo_protocol from StonithNVpair
> > Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter
> > ilo_powerdown_method from StonithNVpair
> > Oct 30 13:33:08 wwwlb02 heartbeat: [6202]: info: Link
> > wwwlb01.microcenter.com:eth0 dead.
> > Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_lstatus_callback:
> > Status update: Ping node wwwlb01.microcenter.com now has status [dead]
> > Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_nstatus_callback:
> > Status update: Ping node wwwlb01.microcenter.com now has status [dead]
> > Oct 30 13:33:12 wwwlb02 stonithd: [30790]: WARN: host list for
> > cl_stonith_lb02:0 is empty, please fix your constraints
> > Oct 30 13:33:12 wwwlb02 stonithd: [6413]: WARN: start
> cl_stonith_lb02:0
> > failed, because its hostlist is empty
> > Oct 30 13:33:12 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM
> > operation cl_stonith_lb02:0_start_0 (call=12, rc=2) complete
> > Oct 30 13:33:13 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0:
> stop
> > Oct 30 13:33:13 wwwlb02 stonithd: [6413]: notice: try to stop a
> resource
> > cl_stonith_lb02:0 who is not in started resource queue.
> > Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing
> > op=cl_stonith_lb02:0_stop_0
> > key=1:8:0:efbdb124-d51a-4228-80bc-7a9464d7971a)
> > Oct 30 13:33:13 wwwlb02 lrmd: [30842]: info: Try to stop STONITH
> > resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe
> > Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM
> > operation cl_stonith_lb02:0_stop_0 (call=13, rc=0) complete
> > 
> > 
> > 
> > Looks like I should specify from additional nvpair's for the ilo's.
> The
> > WARN host list empty message is what looks bad to me.  Here is the cib
> > section for the clone resource and the cib constraint for this
> resource.
> > Please let me know if there is some obvious errors in this
> > configuration.  This is the stonith resource that is to shoot the 02
> > node, intended to run on the 01 node (the 01 node was the node who had
> a
> > network cable disconnect).
> > 
> > 
> > 	<clone id="cl_stonithset_lb02">
> >          <meta_attributes id="cl_stonithset_lb02_meta_attrs">
> >            <attributes>
> >              <nvpair id="cl_stonithset_lb02_metaattr_target_role"
> > name="target_role" value="started"/>
> >              <nvpair id="cl_stonithset_lb02_metaattr_clone_max"
> > name="clone_max" value="1"/>
> >              <nvpair id="cl_stonithset_lb02_metaattr_clone_node_max"
> > name="clone_node_max" value="1"/>
> >            </attributes>
> >          </meta_attributes>
> >          <primitive id="cl_stonith_lb02" class="stonith"
> > type="external/riloe" provider="heartbeat">
> >            <instance_attributes id="cl_stonith_lb02_instance_attrs">
> >              <attributes>
> >                <nvpair id="76163fb5-05ea-4cff-9786-a817774d8224"
> > name="hostlist" value="wwwlb02.microcenter.com"/>
> >                <nvpair id="238e0158-81d3-48fd-879a-494c76d96b80"
> > name="ilo_hostname" value="10.100.254.162"/>
> >                <nvpair id="82de3d5d-6f96-44f0-b98f-6eea75704b33"
> > name="ilo_user" value="Administrator"/>
> >                <nvpair id="0fdef60a-fe62-4a0d-8f8f-d8da1d42082a"
> > name="ilo_password" value="PASSWORD"/>
> >              </attributes>
> >            </instance_attributes>
> >            <operations>
> >              <op id="2a33ffe8-371f-4d08-a1ea-373135e85aeb"
> > name="monitor" interval="30" timeout="20" start_delay="15"
> > disabled="false" role="Started" on_fail="restart"/>
> >              <op id="4694393c-e89b-4371-af1c-a60d7f305e2f"
> name="start"
> > timeout="20" start_delay="0" disabled="false" role="Started"
> > on_fail="restart"/>
> >            </operations>
> >            <meta_attributes id="cl_stonith_lb02:0_meta_attrs">
> >              <attributes>
> >                <nvpair id="cl_stonith_lb02:0_metaattr_target_role"
> > name="target_role" value="started"/>
> >              </attributes>
> >            </meta_attributes>
> >          </primitive>
> >        </clone>
> > 
> >      <constraints>
> >        <rsc_location id="location_on_lb01" rsc="cl_stonithset_lb02">
> >          <rule id="prefered_location_on_lb01" score="INFINITY">
> >            <expression attribute="#uname"
> > id="c9e30917-97e2-4c35-86e7-9df6c7abc497" operation="eq"
> > value="wwwlb01.microcenter.com"/>
> >          </rule>
> >        </rsc_location>
> >      </constraints>
> > 
> > Thanks,
> > -ab
> > 
> > _______________________________________________
> > Pacemaker mailing list
> > Pacemaker at clusterlabs.org
> > http://list.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> 
> 
> > _______________________________________________
> > Pacemaker mailing list
> > Pacemaker at clusterlabs.org
> > http://list.clusterlabs.org/mailman/listinfo/pacemaker
> 
> 
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker
> 
> 
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker