[Pacemaker] Question on ILO stonith resource config and restarting

Wed Oct 29 12:51:44 EDT 2008

I have a 0.6 pacemaker/heartbeat cluster setup in a lab with resources
as follows:

Group-lvs(ordered): two primitives -> ocf/IPddr2 and ocf/ldirectord.
Clone-pingd: set to monitor a couple of Ips and used to set a weight for
where to run the LVS group.

-- This is the area that I have a question on --
Clone-stonith-node1: HP ILO to shoot node1
Clone-stonith-node2: HP ILO to shoot node2

I read on the old linux-ha site that using a clone for ILO/stonith was
the way to go.  I'm not sure I see how this would work correctly and be
preferred over a standard resource.  What I am confused about is this:
the external/riloe stonith plugin only knows how to shoot one node so
why would you want to run it as a clone since each external/riloe is
configured differently.  I went ahead and configured the riloe's as
clones feeling that the docs are correct and that the reason would
become obvious to me later.  (I also saw a similar post with no
response:
http://www.gossamer-threads.com/lists/linuxha/users/35685?nohighlight=1#
35685)

I then noticed that my ILO clones were starting on the 'wrong' nodes.
As in the stonith resource to kill node 2 was actually running on node
2; which is pointless if node 2 locks up.  So I added resource
constraints to force the stonith clone to stay on a node that was not
the one to be shot.  This seemed to work well.

The next issue I have is that when I disconnect the LAN cable on a
single node that connects it to the rest of the network the clone
stonith monitor will fail since it can't connect to the other nodes ILO
for status.  After some time (minutes let's say) I reconnect the LAN
cable but never see the clone stonith come back to life, just stays
failed.  What should I be looking at to make sure that the clone stonith
restarts properly.

Any advice on how to more properly setup an HP ILO stonith in this
scenario would be greatly appreciated.  (I can see where a clone stonith
would be useful in a large cluster of n>2 nodes since all nodes could
have a chance to shoot a failed node and maybe this is the reason for
cloned stonith with ILO?  Basically in a cluster of N nodes each node
would be running N-1 stonith resources, ready to shoot a dead node.)

Thanks in advance,
-ab