[Pacemaker] Re: Problems when DC node is STONITH'ed.

Tue Oct 21 04:35:03 EDT 2008

Hi Dejan,

Dejan Muhamedagic wrote:
> Hi Satomi-san,
> 
> On Thu, Oct 16, 2008 at 03:43:36PM +0900, Satomi TANIGUCHI wrote:
>> Hi Dejan,
>>
>>
>> Dejan Muhamedagic wrote:
>>> Hi Satomi-san,
>>>
>>> On Tue, Oct 14, 2008 at 07:07:00PM +0900, Satomi TANIGUCHI wrote:
>>>> Hi,
>>>>
>>>> I found that there are 2 problems when DC node is STONITH'ed.
>>>> (1) STONITH operation is executed two times.
>>> This has been discussed at length in bugzilla, see
>>>
>>> http://developerbugs.linux-foundation.org/show_bug.cgi?id=1904
>>>
>>> which was resolved with WONTFIX. In short, it was deemed to risky
>>> to implement a remedy for this problem.  Of course, if you think
>>> you can add more to the discussion, please go ahead.
>> Sorry, I missed it.
> 
> Well, you couldn't have known about it :)
> 
>> Thank you for your pointing!
>> I understand how it came about.
>>
>> Ideally, when DC-node is going to be STONITH'ed,
>> the new DC-node is elected and it STONITHs the ex-DC,
>> then these problems will not occur.
>> But maybe it is not good way from the viewpoint of emergency
>> because the ex-DC should be STONITH'ed as soon as possible.
> 
> Yes, you're right about this.
> 
>> Anyway, I understand this is an expected behavior, thanks!
>> But then, it seems that tengine has to keep having a timeout for waiting
>> stonithd's result, and long cluster-delay is still required.
> 
> If I understood Andrew correctly, the tengine will wait forever,
> until stonithd sends a message. Or dies which, let's hope, won't
> happen.
My perception is the same as you.

> 
>> Because second STONITH is requested on that transition timeout.
>> I'm afraid that I misunderstood the true meaning of what Andrew said.
> 
> In the bugzilla? If so, please reopen and voice your concerns.
I asked him again in bugzilla, thanks!

> 
>>>> (2) Timeout-value which stonithd on DC node waits to reply
>>>>     the result of STONITH op from other node is
>>>>     always set to "stonith-timeout" in <cluster_property_set>.
>>>> [...]
>>>> The case (2):
>>>> When this timeout occurs on stonithd on DC
>>>> during non-DC node's stonithd tries to reset DC,
>>>> DC-stonithd will send a request to other node,
>>>> and two or more STONITH plugins are executed in parallel.
>>>> This is a troublesome problem.
>>>> The most suitable value as this timeout might be
>>>> the sum total of "stonith-timeout" of STONITH plugins on the node
>>>> which is going to receive the STONITH request from DC node, I think.
>>> This would probably be very difficult for the CRM to get.
>> Right, I agree with you.
>> I meant "it is difficult because stonithd on DC can't know the values of
>> stonith-timeout on other node." with the following sentence
>> "But DC node can't know that...".
>>>> But DC node can't know that...
>>>> I would like to hear your opinions.
>>> Sorry, but I couldn't exactly follow. Could you please describe
>>> it in terms of actions.
>> Sorry, I restate what I meant.
>> The timeout which stonithd on DC waits for the return of other node's
>> stonithd needs the value that is longer than the sum total of "stonith-timeout"
>> of STONITH plugins on the node by all rights.
>> But it is so difficult to get the values for DC-stonithd.
>> Then I would like to hear your opinion about what is suitable and practical
>> value as this timeout which is set in insert_into_executing_queue().
>> I hope I conveyed to you what I want to say.
> 
> OK, I suppose I understand now. You're talking about the timeouts
> for remote fencing operations, right? And the originating
Exactly!

> stonithd hasn't got a clue on how long the remote fencing
> operation may take. Well, that could be a problem. I can't think
> of anything to resolve that completely, not without "rewiring"
> stonithd. stonithd broadcasts the request so there's no way for
> it to know who's doing what and when and how long it can take.
> 
> The only workaround I can think of is to use the global (cluster
> property) stonith-timeout which should be set to the maximum sum
> of stonith timeouts for a node.
All right.
I misunderstood the role of the global stonith-timeout.
I considered it just the default value for each plugin's stonith-timeout
as if default-action-timeout is for each operation.
To use stonith-timeout correctly (without troublesome timeouts),
we should keep the following, right?
  - set stonith-timeout for every STONITH plugin.
  - set the global stonith-timeout to the maximum sum of stonith timeouts
    for a node.
  - (set cluster-delay to longer than global stonith-timeout,
     at least at present.)

> 
> Now, back to reality ;-)  Timeouts are important, of course, but
> one should usually leave a generous margin on top of the expected
> duration. For instance, if the normal timeout for an operation on
> a device is 30 seconds, there's nothing wrong in setting it to
> say one or two minutes. The consequences of an operation ending
> prematurely are much more serious than if one waits a bit longer.
> After all, if there's something really wrong, it is usually
> detected early and the error reported immediately. Of course,
> one shouldn't follow this advice blindly. Know your cluster!
Understood!

> 
>> For reference, I attached logs when the aforesaid timeout occurs.
>> The cluster has 3 nodes.
>> When DC was going to be STONITH'ed, DC sent a request all of non-DC nodes,
>> and all of them tried to shutdown DC.
> 
> No, the tengine (running on DC) always talks to the local
> stonithd.
I meant "stontihd on DC broadcast a request"
with the sentence "DC sent a request all of non-DC nodes".
I'm sorry for being ambiguous.

> 
>> And the timeout on DC-stonithd occured, DC-stonithd sent the same request,
>> then two or more STONITH plugin worked in parallel on every non-DC node.
>> (Please see sysstats.txt.)
>>
>> I want to make clear whether the current behavior is expected or a bug.
> 
> That's actually wrong, but could be considered a configuration
> problem:
> 
>   <cluster_property_set id="cib-bootstrap-options">
>   ...
>         <nvpair id="nvpair.id2000009" name="stonith-timeout" value="260s"/>
>   ...
>   <primitive id="prmStonithN1" class="stonith" type="external/ssh">
>   ...
> 	  <nvpair id="nvpair.id2000602" name="stonith-timeout" value="390s"/>
> 
> The stonithd initiator (the one running on the DC) times out
> before the remote fencing operation. On retry a second remote
> fencing operation is started. That's why you see two of them.
I set these values because I wanted to know what would be happen
when the timeout for remote fencing op occurs, and I intended to talk about it
on MailingList if curious behavior appears. ;)

> 
> Anyway, you can open a bugzilla for this, because the stonithd on
> a remote host should know that there's already one operation
> running. Unfortunately, I'm busy with more urgent matters right
> now, so it may take a few weeks until I take a look at it.
> As usual, patches are welcome :)
I posted into bugzilla.
http://developerbugs.linux-foundation.org/show_bug.cgi?id=1983
I'm sorry to bother you.

Best Regards,
SatomiTANIGUCHI

> 
> Thanks,
> 
> Dejan
> 
>> But I consider that the root of every problem is the node which sends STONITH
>> request and wait for completion of the op is killed.
>>
>>
>> Regards,
>> Satomi TANIGUCHI
>>
>>
>>> Thanks,
>>>
>>> Dejan
>>>
>>>> Best Regards,
>>>> Satomi TANIGUCHI
>>>
>>>> _______________________________________________________
>>>> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>>>> Home Page: http://linux-ha.org/
>>> _______________________________________________________
>>> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>>> Home Page: http://linux-ha.org/
>>
>>
>>
>>
> 
> 
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker at clusterlabs.org
>> http://list.clusterlabs.org/mailman/listinfo/pacemaker
> 
> 
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker