[Pacemaker] STONITH Deathmatch Explained

Tim Serong tim at wirejunkie.com
Mon Aug 10 10:41:42 EDT 2009

I wrote:
>>> I've written up a brief document entitled "STONITH Deathmatch Explained
>>> (and Some Hints for Resource Agent Authors and Systems Engineers)":
>>>   http://ourobengr.com/ha
>>> ...

Then Dejan Muhamedagic wrote:
>> ...
>> - in "Causes ..." you missed to mention split-brain (no
>>  communication channels working) and, at the same time, to
>>  stress how important it is to have redundant communications :)
>> - even though you mention that in the title, I'd still move the
>>  resource agent intricacies into another document; they are all
>>  very valuable advice, but of no concern to cluster
>>  administrators; it's also good to keep the focus on our little
>>  problem; then you'll have to find other "Things You Didn't
>>  Think Of" (or just keep the title and leave the section empty:
>>  it is important; or insert another illustration)
>> - devote more space/thought to the part on how to avoid a
>>  "deathmatch"; there's only a mention on chkconfig within
>>  "Debugging ..." (or one can also use the "poweroff" fencing
>>  operation); also, note that this occurs only in cases reboot
>>  doesn't fix a problem (e.g. split-brain)

And Joe Armstrong wrote:
> ...You might want to also add a possibility
> to avoid the situation.  Don't allow heartbeat to be started by
> the RC scripts.  Once a machine has been STONITH'd you can consider
> that it is untrustworthy until the admin inspects the reason for
> the failure and manually allows the node back into the cluster.
> This same thinking is why I hate auto-failback...

For the record, I've made a couple of minor updates based on the above:

- Split-brain is added as a cause of STONITH.
- There's now a small section "Avoiding STONITH Deathmatch", which
  mentions ensuring redundant comms, not starting the cluster at boot
  time, and trying stonith-action=poweroff.
- There's a mention of the document still being applicable if you're
  using OpenAIS instead of Heartbeat.

I haven't moved RA specifics into another document yet.  I have a nasty
feeling this might result in something larger that rattles on about the
importance of ensuring correct semantics for all operations (e.g.: the
"start" op shouldn't return success if the resource isn't really, truly,
actually, completely started yet, or you can wind up in one of those
wacky start[ok]->monitor[fail]->stop->start[ok]->monitor[fail]->stop

tim at wirejunkie.com

More information about the Pacemaker mailing list