[Pacemaker] Re: crm_mon shows nothing about stonith 'reset' failure

Tue Sep 16 21:07:02 EDT 2008

Hi Andrew,

 > The whole status section is periodically reconstructed - so any
 > stonith failures that were recorded there could be lost at any time.
 > So rather than store inconsistent and possibly incorrect data, we
 > don't store anything.

Thanks for more detailed explanation.

 > STONITH is the single most critical part of the cluster.
 > Without a reliable STONITH mechanism, your cluster will not be able to
 > recover after some failures or, even worse, try to recover when it
 > should not have and corrupt all your data.
 >
 >
 > So if your STONITH mechanism is broken, then very clearly, _that_ is
 > your biggest problem.
 >
 >
 >> >
 >> > b) The only way to know stonith 'reset' failures is watching
 >> >   the logs. Do I understand right?
 >
 > Unless something in stonithd changes. Yes.

Hmm... If STONITH is so much important, all the more there should
be a intuitive way to know its activity.

I will post my ideas if they will come to my mind.

> On Tue, Sep 16, 2008 at 11:48, Takenaka Kazuhiro
> <takenaka.kazuhiro at oss.ntt.co.jp> wrote:
>> > Hi, Andrew
>> >
>>> >> Nope.
>>> >> This is not stored anywhere since there is nowhere it can be
>>> >> reconstructed from (like the lrmd for resource operations) when
>>> >> rebuilding the status section.
>> >
>> > Why does the current cib.xml definition have no room for
>> > stonith 'reset' failures? Simply not implemented? Or is
>> > there any other reason?
> 
> I already gave you the reason.
> 
>>> >> ... since there is nowhere it can be
>>> >> reconstructed from (like the lrmd for resource operations) when
>>> >> rebuilding the status section.
> 
> The whole status section is periodically reconstructed - so any
> stonith failures that were recorded there could be lost at any time.
> So rather than store inconsistent and possibly incorrect data, we
> don't store anything.
> 
>> >
>>> >> And if your stonith resources are failing, a) you have bigger
>>> >> problems, and b) you'll get nice big ERROR messages in the logs.
>> >
>> > a) I saw 'dummy' didn't fail over. Is this a "bigger problems"?
> 
> Depends what 'dummy' is.
> But assuming its just a resource then no, that's the least of your problems.
> 
> 
> STONITH is the single most critical part of the cluster.
> Without a reliable STONITH mechanism, your cluster will not be able to
> recover after some failures or, even worse, try to recover when it
> should not have and corrupt all your data.
> 
> 
> So if your STONITH mechanism is broken, then very clearly, _that_ is
> your biggest problem.
> 
> 
>> >
>> > b) The only way to know stonith 'reset' failures is watching
>> >   the logs. Do I understand right?
> 
> Unless something in stonithd changes. Yes.
> 
>> >
>>> >> On Tue, Sep 16, 2008 at 03:11, Takenaka Kazuhiro
>>> >> <takenaka.kazuhiro at oss.ntt.co.jp> wrote:
>>>> >>>
>>>>> >>> > Hi All,
>>>>> >>> >
>>>>> >>> > I ran a test to see what would happen when stonith 'reset' failed.
>>>>> >>> > Before the test, I thought 'crm_mon' should show something about the
>>>>> >>> > failure.
>>> >>
>>> >> Nope.
>>> >> This is not stored anywhere since there is nowhere it can be
>>> >> reconstructed from (like the lrmd for resource operations) when
>>> >> rebuilding the status section.
>>> >>
>>> >> And if your stonith resources are failing, a) you have bigger
>>> >> problems, and b) you'll get nice big ERROR messages in the logs.
>>> >>
>>>>> >>> > But 'crm_mon' didn't show anything.
>>>>> >>> >
>>>>> >>> > What I did is the following.
>>>>> >>> >
>>>>> >>> > 1. I started the stonith-enabled two nodes cluster. The names of
>>>>> >>> >   the nodes were 'node01' and 'node02'.  See configuration files
>>>>> >>> >   in attached 'hb_reports.tgz' for more details.
>>>>> >>> >
>>>>> >>> >   I made a few modifications to 'ssh' for the test and renamed it
>>>>> >>> >   to 'sshTEST'. I also attached 'sshTEST'. The diferences are
>>>>> >>> >   written in it.
>>>>> >>> >
>>>>> >>> > 2. I performed the following command.
>>>>> >>> >
>>>>> >>> >   # iptables -A INPUT -i eth3 -p tcp --dport 22 -j REJECT
>>>>> >>> >
>>>>> >>> >   'eth3' is connected to the network for 'sshTEST'.
>>>>> >>> >
>>>>> >>> > 3. I deleted the state file of 'dummy' at 'node01'.
>>>>> >>> >
>>>>> >>> >   # rm -f /var/run/heartbeat/rsctmp/Dummy-dummy.state
>>>>> >>> >
>>>>> >>> > Soon the failure of 'dummy' was logged into /var/log/ha-log
>>>>> >>> > and 'crm_mon' also displayed it.
>>>>> >>> >
>>>>> >>> > After a while the failure of 'reset' performed by 'sshTEST'
>>>>> >>> > also logged, but 'crm_mon' didn't display it.
>>>>> >>> >
>>>>> >>> > Did I make any misconfigurations or any misoperations that
>>>>> >>> > made 'crm_mon' work incorrectly.
>>>>> >>> >
>>>>> >>> > Or 'crm_mon' really don't show anything about stonith 'reset'
>>>>> >>> > failure ?
>>>>> >>> >
>>>>> >>> > I used Heartbeat(e8154a602bf4) + Pacemaker(d4a14f276c28)
>>>>> >>> > for this test.
>>>>> >>> >
>>>>> >>> > Best regard.
>>>>> >>> > --
>>>>> >>> > Takenaka Kazuhiro <takenaka.kazuhiro at oss.ntt.co.jp>
>> >
>> >
>> > --
>> > Takenaka Kazuhiro <takenaka.kazuhiro at oss.ntt.co.jp>
>> >
>> > _______________________________________________
>> > Pacemaker mailing list
>> > Pacemaker at clusterlabs.org
>> > http://list.clusterlabs.org/mailman/listinfo/pacemaker
>> >
-- 
Takenaka Kazuhiro <takenaka.kazuhiro at oss.ntt.co.jp>