[Pacemaker] Question on ILO stonith resource config and restarting

Thu Oct 30 15:07:24 EDT 2008

Just realized that I only included the log entries from the node that
was not experiencing a network disconnect.  Attached are the log entries
from the node (01) that had the stonith resource running before the
cable disconnect and looks like they provide some more useful
information.  Also included up through when the network cable was
reconnected.

-ab

>> I have a 0.6 pacemaker/heartbeat cluster setup in a lab with
resources
>> as follows:
>> 
>> Group-lvs(ordered): two primitives -> ocf/IPddr2 and ocf/ldirectord.
>> Clone-pingd: set to monitor a couple of Ips and used to set a weight
for
>> where to run the LVS group.
>> 
>> -- This is the area that I have a question on --
>> Clone-stonith-node1: HP ILO to shoot node1
>> Clone-stonith-node2: HP ILO to shoot node2
>> 
>> I read on the old linux-ha site that using a clone for ILO/stonith
was
>> the way to go.  I'm not sure I see how this would work correctly and
be
>> preferred over a standard resource.  What I am confused about is
this:
>> the external/riloe stonith plugin only knows how to shoot one node so
>
>Please make sure that you use the latest edition of
>external/riloe. The previous one didn't work under all
>circumstances.

I am using the version that came with heartbeat-common-2.99.0-3.1
(according rpm -qf)

To clear my current issue where the stonith resource was not started
(and since this is still in the lab) I have rebooted both nodes to start
with a somewhat clean slate.  I have attempted to grab some more useful
information from the logs on why the resource is not restarting from.
Again I disconnect the LAN cable connecting a node to the rest of the
network (a private HB channel is still available and the ILO is still
up).  I noticed these entries in the log:

Oct 30 13:33:07 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing
op=cl_stonith_lb02:0_start_0
key=18:7:0:efbdb124-d51a-4228-80bc-7a9464d7971a)
Oct 30 13:33:07 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0: start
Oct 30 13:33:07 wwwlb02 lrmd: [30788]: info: Try to start STONITH
resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe
Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter
ilo_can_reset from StonithNVpair
Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter
ilo_protocol from StonithNVpair
Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter
ilo_powerdown_method from StonithNVpair
Oct 30 13:33:08 wwwlb02 heartbeat: [6202]: info: Link
wwwlb01.microcenter.com:eth0 dead.
Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_lstatus_callback:
Status update: Ping node wwwlb01.microcenter.com now has status [dead]
Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_nstatus_callback:
Status update: Ping node wwwlb01.microcenter.com now has status [dead]
Oct 30 13:33:12 wwwlb02 stonithd: [30790]: WARN: host list for
cl_stonith_lb02:0 is empty, please fix your constraints
Oct 30 13:33:12 wwwlb02 stonithd: [6413]: WARN: start cl_stonith_lb02:0
failed, because its hostlist is empty
Oct 30 13:33:12 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM
operation cl_stonith_lb02:0_start_0 (call=12, rc=2) complete
Oct 30 13:33:13 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0: stop
Oct 30 13:33:13 wwwlb02 stonithd: [6413]: notice: try to stop a resource
cl_stonith_lb02:0 who is not in started resource queue.
Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing
op=cl_stonith_lb02:0_stop_0
key=1:8:0:efbdb124-d51a-4228-80bc-7a9464d7971a)
Oct 30 13:33:13 wwwlb02 lrmd: [30842]: info: Try to stop STONITH
resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe
Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM
operation cl_stonith_lb02:0_stop_0 (call=13, rc=0) complete

Looks like I should specify from additional nvpair's for the ilo's.  The
WARN host list empty message is what looks bad to me.  Here is the cib
section for the clone resource and the cib constraint for this resource.
Please let me know if there is some obvious errors in this
configuration.  This is the stonith resource that is to shoot the 02
node, intended to run on the 01 node (the 01 node was the node who had a
network cable disconnect).

	<clone id="cl_stonithset_lb02">
         <meta_attributes id="cl_stonithset_lb02_meta_attrs">
           <attributes>
             <nvpair id="cl_stonithset_lb02_metaattr_target_role"
name="target_role" value="started"/>
             <nvpair id="cl_stonithset_lb02_metaattr_clone_max"
name="clone_max" value="1"/>
             <nvpair id="cl_stonithset_lb02_metaattr_clone_node_max"
name="clone_node_max" value="1"/>
           </attributes>
         </meta_attributes>
         <primitive id="cl_stonith_lb02" class="stonith"
type="external/riloe" provider="heartbeat">
           <instance_attributes id="cl_stonith_lb02_instance_attrs">
             <attributes>
               <nvpair id="76163fb5-05ea-4cff-9786-a817774d8224"
name="hostlist" value="wwwlb02.microcenter.com"/>
               <nvpair id="238e0158-81d3-48fd-879a-494c76d96b80"
name="ilo_hostname" value="10.100.254.162"/>
               <nvpair id="82de3d5d-6f96-44f0-b98f-6eea75704b33"
name="ilo_user" value="Administrator"/>
               <nvpair id="0fdef60a-fe62-4a0d-8f8f-d8da1d42082a"
name="ilo_password" value="PASSWORD"/>
             </attributes>
           </instance_attributes>
           <operations>
             <op id="2a33ffe8-371f-4d08-a1ea-373135e85aeb"
name="monitor" interval="30" timeout="20" start_delay="15"
disabled="false" role="Started" on_fail="restart"/>
             <op id="4694393c-e89b-4371-af1c-a60d7f305e2f" name="start"
timeout="20" start_delay="0" disabled="false" role="Started"
on_fail="restart"/>
           </operations>
           <meta_attributes id="cl_stonith_lb02:0_meta_attrs">
             <attributes>
               <nvpair id="cl_stonith_lb02:0_metaattr_target_role"
name="target_role" value="started"/>
             </attributes>
           </meta_attributes>
         </primitive>
       </clone>

     <constraints>
       <rsc_location id="location_on_lb01" rsc="cl_stonithset_lb02">
         <rule id="prefered_location_on_lb01" score="INFINITY">
           <expression attribute="#uname"
id="c9e30917-97e2-4c35-86e7-9df6c7abc497" operation="eq"
value="wwwlb01.microcenter.com"/>
         </rule>
       </rsc_location>
     </constraints>

Thanks,
-ab

_______________________________________________
Pacemaker mailing list
Pacemaker at clusterlabs.org
http://list.clusterlabs.org/mailman/listinfo/pacemaker

-------------- next part --------------
A non-text attachment was scrubbed...
Name: log.gz
Type: application/x-gzip
Size: 4685 bytes
Desc: log.gz
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20081030/e9cb22c3/attachment-0001.bin>