[Pacemaker] trying to stabilize sbd stonith

Sander van Vugt mail at sandervanvugt.nl
Thu Feb 25 14:49:02 EST 2010


Some additions
On Thu, 2010-02-25 at 20:30 +0100, Sander van Vugt wrote:
> Following up on my message that I've sent yesterday. In my 2-node test
> cluster, sbd stonith works in an excellent way. In my customers 3-node
> cluster it almost works in an excellent way. That is: I've got one node
> that is in an uninterrupted STONITH loop. It comes up with a status
> online, then online becomes online(clean) after which it receives a
> stonith and restart. I think it's kind of cool to see that it works, but
> I would like to get out of this loop. I've got the feeling that I'm
> missing something very obvious. What I know is that it does see the
> stonith device. But: the softdog watchdog module doesn't want to load,
> and I have no clue what the watchdog module for this server (Dell
> PowerEdge 2950) might be. Or am I looking in the wrong direction? I've
> got the impression that I am overlooking something very obvious
> (therefore, no log files and other information (yet))

So I decided to have a look at the logs anyway, after verifying that
I've applied the complete procedure that Lars has sent me yesterday. Now
it appears that the Meatware stonith resource that I've used for testing
purposes is doing something nasty. Here's what happens:

1.	I start openais (rcopenais start) on node3 (for some reason it
doesn't come up automatically).
2.	It comes up, node1 sees that and says "hey, I've got a meatware
stonith waiting for you, please admin run meatclient -c node3 to make
sure it's gone" and then the 3rd node reboots. 

Now the interesting part is that I've removed the meatware stonith agent
already from the cluster, so it looks like it is zombie-ing still
around. Is there any way to get this meatclient zombie out of the system
without actually restarting the entire cluster?


