[Pacemaker] Question on ILO stonith resource config and restarting

Tue Nov 4 14:20:42 EST 2008

Thanks for taking a look into this more.

I have pulled down the 'tip' version of Linux-HA and copied over the new
./lib/plugins/stonith/external/riloe into the system install path (did a
diff and there are significant changes).
Rebooted both nodes in this cluster.
Started same test again... Node 1 loses primary network connection to
LAN, thereby not able to get status or connect to the Stonith device
(ILO) for Node 2.

The monitor process for the riloe appears to timeout and it is still
downhill from there (here are log entries from Node1 who lost the
network connection):

Nov  4 13:25:28 wwwlb01 kernel: bnx2: eth0 NIC Copper Link is Down
Nov  4 13:25:58 wwwlb01 lrmd: [8224]: WARN: cl_stonith_lb02:0:monitor
process (PID 9213) timed out (try 1).  Killing with signal SIGTERM (15).
Nov  4 13:25:58 wwwlb01 lrmd: [9213]: ERROR: stonithd_receive_ops_result
failed.
Nov  4 13:25:58 wwwlb01 lrmd: [8224]: WARN: mapped the invalid return
code 254.
Nov  4 13:25:58 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM
operation cl_stonith_lb02:0_monitor_30000 (call=10, rc=1) complete
...
Nov  4 13:25:59 wwwlb01 crmd: [8227]: info: do_lrm_rsc_op: Performing
op=cl_stonith_lb02:0_stop_0
key=5:3:0:1eb0bdb2-c828-4b6d-b712-cf7049c775df)
Nov  4 13:25:59 wwwlb01 lrmd: [8224]: info: rsc:cl_stonith_lb02:0: stop
...
Nov  4 13:25:59 wwwlb01 lrmd: [9898]: info: Try to stop STONITH resource
<rsc_id=cl_stonith_lb02:0> : Device=external/riloe
...
Nov  4 13:26:00 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM
operation cl_stonith_lb02:0_monitor_30000 (call=10, rc=-2) Cancelled
Nov  4 13:26:00 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM
operation cl_stonith_lb02:0_stop_0 (call=12, rc=0) complete
Nov  4 13:26:01 wwwlb01 crmd: [8227]: info: do_lrm_rsc_op: Performing
op=cl_stonith_lb02:0_start_0
key=19:3:0:1eb0bdb2-c828-4b6d-b712-cf7049c775df)
Nov  4 13:26:01 wwwlb01 lrmd: [8224]: info: rsc:cl_stonith_lb02:0: start
Nov  4 13:26:01 wwwlb01 lrmd: [9902]: info: Try to start STONITH
resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe
Nov  4 13:26:01 wwwlb01 stonithd: [8225]: info: Cannot get parameter
ilo_can_reset from StonithNVpair
Nov  4 13:26:01 wwwlb01 stonithd: [8225]: info: Cannot get parameter
ilo_protocol from StonithNVpair
Nov  4 13:26:01 wwwlb01 stonithd: [8225]: info: Cannot get parameter
ilo_powerdown_method from StonithNVpair
...
Nov  4 13:26:13 wwwlb01 stonithd: [9904]: info: external_run_cmd:
Calling '/usr/lib64/stonith/plugins/external/riloe status' returned 256
Nov  4 13:26:13 wwwlb01 stonithd: [8225]: WARN: start cl_stonith_lb02:0
failed, because its hostlist is empty
Nov  4 13:26:13 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM
operation cl_stonith_lb02:0_start_0 (call=13, rc=1) complete
Nov  4 13:26:14 wwwlb01 crmd: [8227]: info: do_lrm_rsc_op: Performing
op=cl_stonith_lb02:0_stop_0
key=4:4:0:1eb0bdb2-c828-4b6d-b712-cf7049c775df)
Nov  4 13:26:14 wwwlb01 lrmd: [8224]: info: rsc:cl_stonith_lb02:0: stop
Nov  4 13:26:14 wwwlb01 lrmd: [9917]: info: Try to stop STONITH resource
<rsc_id=cl_stonith_lb02:0> : Device=external/riloe
Nov  4 13:26:14 wwwlb01 stonithd: [8225]: notice: try to stop a resource
cl_stonith_lb02:0 who is not in started resource queue.
Nov  4 13:26:14 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM
operation cl_stonith_lb02:0_stop_0 (call=14, rc=0) complete
Nov  4 13:26:19 wwwlb01 cib: [8223]: info: cib_stats: Processed 44
operations (3409.00us average, 0% utilization) in the last 10min
Nov  4 13:27:34 wwwlb01 kernel: bnx2: eth0 NIC Copper Link is Up, 100
Mbps full duplex
Nov  4 13:27:35 wwwlb01 heartbeat: [5969]: info: Link
wwwlb02.microcenter.com:eth0 up.

In playing with the riloe python script I assume that the call to
HTTPSConnection is hanging and then being later killed by lrmd.  It
looks like Python 2.6 added a timeout argument to the HTTPSConnection
call.  The system is running 2.4.3 so I couldn't test it.  I do see that
the socket timeout can be set like this:
	socket.setdefaulttimeout(1)
I will follow this up by saying that my Python skills are very rusty.

I am trying to find out what the expected behavior should be for a
timeout on a start or monitor command.  Should Stonith agents follow the
OCF resource agent specs?

Thanks,
-ab

-----Original Message-----
From: pacemaker-bounces at clusterlabs.org
[mailto:pacemaker-bounces at clusterlabs.org] On Behalf Of Dejan
Muhamedagic
Sent: Tuesday, November 04, 2008 11:26 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Question on ILO stonith resource config and
restarting

On Thu, Oct 30, 2008 at 03:07:24PM -0400, Aaron Bush wrote:
> Just realized that I only included the log entries from the node that
> was not experiencing a network disconnect.  Attached are the log
entries
> from the node (01) that had the stonith resource running before the
> cable disconnect and looks like they provide some more useful
> information.  Also included up through when the network cable was
> reconnected.

The monitor operation on riloe failed. You should definitely
upgrade.

Thanks,

Dejan

> 
> -ab
> 
> >> I have a 0.6 pacemaker/heartbeat cluster setup in a lab with
> resources
> >> as follows:
> >> 
> >> Group-lvs(ordered): two primitives -> ocf/IPddr2 and
ocf/ldirectord.
> >> Clone-pingd: set to monitor a couple of Ips and used to set a
weight
> for
> >> where to run the LVS group.
> >> 
> >> -- This is the area that I have a question on --
> >> Clone-stonith-node1: HP ILO to shoot node1
> >> Clone-stonith-node2: HP ILO to shoot node2
> >> 
> >> I read on the old linux-ha site that using a clone for ILO/stonith
> was
> >> the way to go.  I'm not sure I see how this would work correctly
and
> be
> >> preferred over a standard resource.  What I am confused about is
> this:
> >> the external/riloe stonith plugin only knows how to shoot one node
so
> >
> >Please make sure that you use the latest edition of
> >external/riloe. The previous one didn't work under all
> >circumstances.
> 
> I am using the version that came with heartbeat-common-2.99.0-3.1
> (according rpm -qf)
> 
> To clear my current issue where the stonith resource was not started
> (and since this is still in the lab) I have rebooted both nodes to
start
> with a somewhat clean slate.  I have attempted to grab some more
useful
> information from the logs on why the resource is not restarting from.
> Again I disconnect the LAN cable connecting a node to the rest of the
> network (a private HB channel is still available and the ILO is still
> up).  I noticed these entries in the log:
> 
> Oct 30 13:33:07 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing
> op=cl_stonith_lb02:0_start_0
> key=18:7:0:efbdb124-d51a-4228-80bc-7a9464d7971a)
> Oct 30 13:33:07 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0:
start
> Oct 30 13:33:07 wwwlb02 lrmd: [30788]: info: Try to start STONITH
> resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe
> Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter
> ilo_can_reset from StonithNVpair
> Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter
> ilo_protocol from StonithNVpair
> Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter
> ilo_powerdown_method from StonithNVpair
> Oct 30 13:33:08 wwwlb02 heartbeat: [6202]: info: Link
> wwwlb01.microcenter.com:eth0 dead.
> Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_lstatus_callback:
> Status update: Ping node wwwlb01.microcenter.com now has status [dead]
> Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_nstatus_callback:
> Status update: Ping node wwwlb01.microcenter.com now has status [dead]
> Oct 30 13:33:12 wwwlb02 stonithd: [30790]: WARN: host list for
> cl_stonith_lb02:0 is empty, please fix your constraints
> Oct 30 13:33:12 wwwlb02 stonithd: [6413]: WARN: start
cl_stonith_lb02:0
> failed, because its hostlist is empty
> Oct 30 13:33:12 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM
> operation cl_stonith_lb02:0_start_0 (call=12, rc=2) complete
> Oct 30 13:33:13 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0:
stop
> Oct 30 13:33:13 wwwlb02 stonithd: [6413]: notice: try to stop a
resource
> cl_stonith_lb02:0 who is not in started resource queue.
> Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing
> op=cl_stonith_lb02:0_stop_0
> key=1:8:0:efbdb124-d51a-4228-80bc-7a9464d7971a)
> Oct 30 13:33:13 wwwlb02 lrmd: [30842]: info: Try to stop STONITH
> resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe
> Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM
> operation cl_stonith_lb02:0_stop_0 (call=13, rc=0) complete
> 
> 
> 
> Looks like I should specify from additional nvpair's for the ilo's.
The
> WARN host list empty message is what looks bad to me.  Here is the cib
> section for the clone resource and the cib constraint for this
resource.
> Please let me know if there is some obvious errors in this
> configuration.  This is the stonith resource that is to shoot the 02
> node, intended to run on the 01 node (the 01 node was the node who had
a
> network cable disconnect).
> 
> 
> 	<clone id="cl_stonithset_lb02">
>          <meta_attributes id="cl_stonithset_lb02_meta_attrs">
>            <attributes>
>              <nvpair id="cl_stonithset_lb02_metaattr_target_role"
> name="target_role" value="started"/>
>              <nvpair id="cl_stonithset_lb02_metaattr_clone_max"
> name="clone_max" value="1"/>
>              <nvpair id="cl_stonithset_lb02_metaattr_clone_node_max"
> name="clone_node_max" value="1"/>
>            </attributes>
>          </meta_attributes>
>          <primitive id="cl_stonith_lb02" class="stonith"
> type="external/riloe" provider="heartbeat">
>            <instance_attributes id="cl_stonith_lb02_instance_attrs">
>              <attributes>
>                <nvpair id="76163fb5-05ea-4cff-9786-a817774d8224"
> name="hostlist" value="wwwlb02.microcenter.com"/>
>                <nvpair id="238e0158-81d3-48fd-879a-494c76d96b80"
> name="ilo_hostname" value="10.100.254.162"/>
>                <nvpair id="82de3d5d-6f96-44f0-b98f-6eea75704b33"
> name="ilo_user" value="Administrator"/>
>                <nvpair id="0fdef60a-fe62-4a0d-8f8f-d8da1d42082a"
> name="ilo_password" value="PASSWORD"/>
>              </attributes>
>            </instance_attributes>
>            <operations>
>              <op id="2a33ffe8-371f-4d08-a1ea-373135e85aeb"
> name="monitor" interval="30" timeout="20" start_delay="15"
> disabled="false" role="Started" on_fail="restart"/>
>              <op id="4694393c-e89b-4371-af1c-a60d7f305e2f"
name="start"
> timeout="20" start_delay="0" disabled="false" role="Started"
> on_fail="restart"/>
>            </operations>
>            <meta_attributes id="cl_stonith_lb02:0_meta_attrs">
>              <attributes>
>                <nvpair id="cl_stonith_lb02:0_metaattr_target_role"
> name="target_role" value="started"/>
>              </attributes>
>            </meta_attributes>
>          </primitive>
>        </clone>
> 
>      <constraints>
>        <rsc_location id="location_on_lb01" rsc="cl_stonithset_lb02">
>          <rule id="prefered_location_on_lb01" score="INFINITY">
>            <expression attribute="#uname"
> id="c9e30917-97e2-4c35-86e7-9df6c7abc497" operation="eq"
> value="wwwlb01.microcenter.com"/>
>          </rule>
>        </rsc_location>
>      </constraints>
> 
> Thanks,
> -ab
> 
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker
> 

> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker

_______________________________________________
Pacemaker mailing list
Pacemaker at clusterlabs.org
http://list.clusterlabs.org/mailman/listinfo/pacemaker