[Pacemaker] Re: Problems when DC node is STONITH'ed.

Wed Oct 29 06:46:33 EDT 2008

Hi Satomi-san,

On Tue, Oct 21, 2008 at 05:35:03PM +0900, Satomi TANIGUCHI wrote:
> Hi Dejan,
>
>
> Dejan Muhamedagic wrote:
>> Hi Satomi-san,
>>
>> On Thu, Oct 16, 2008 at 03:43:36PM +0900, Satomi TANIGUCHI wrote:
>>> Hi Dejan,
>>>
>>>
>>> Dejan Muhamedagic wrote:
>>>> Hi Satomi-san,
>>>>
>>>> On Tue, Oct 14, 2008 at 07:07:00PM +0900, Satomi TANIGUCHI wrote:
>>>>> Hi,
>>>>>
>>>>> I found that there are 2 problems when DC node is STONITH'ed.
>>>>> (1) STONITH operation is executed two times.
>>>> This has been discussed at length in bugzilla, see
>>>>
>>>> http://developerbugs.linux-foundation.org/show_bug.cgi?id=1904
>>>>
>>>> which was resolved with WONTFIX. In short, it was deemed to risky
>>>> to implement a remedy for this problem.  Of course, if you think
>>>> you can add more to the discussion, please go ahead.
>>> Sorry, I missed it.
>>
>> Well, you couldn't have known about it :)
>>
>>> Thank you for your pointing!
>>> I understand how it came about.
>>>
>>> Ideally, when DC-node is going to be STONITH'ed,
>>> the new DC-node is elected and it STONITHs the ex-DC,
>>> then these problems will not occur.
>>> But maybe it is not good way from the viewpoint of emergency
>>> because the ex-DC should be STONITH'ed as soon as possible.
>>
>> Yes, you're right about this.
>>
>>> Anyway, I understand this is an expected behavior, thanks!
>>> But then, it seems that tengine has to keep having a timeout for waiting
>>> stonithd's result, and long cluster-delay is still required.
>>
>> If I understood Andrew correctly, the tengine will wait forever,
>> until stonithd sends a message. Or dies which, let's hope, won't
>> happen.
> My perception is the same as you.
>
>>
>>> Because second STONITH is requested on that transition timeout.
>>> I'm afraid that I misunderstood the true meaning of what Andrew said.
>>
>> In the bugzilla? If so, please reopen and voice your concerns.
> I asked him again in bugzilla, thanks!
>
>>
>>>>> (2) Timeout-value which stonithd on DC node waits to reply
>>>>>     the result of STONITH op from other node is
>>>>>     always set to "stonith-timeout" in <cluster_property_set>.
>>>>> [...]
>>>>> The case (2):
>>>>> When this timeout occurs on stonithd on DC
>>>>> during non-DC node's stonithd tries to reset DC,
>>>>> DC-stonithd will send a request to other node,
>>>>> and two or more STONITH plugins are executed in parallel.
>>>>> This is a troublesome problem.
>>>>> The most suitable value as this timeout might be
>>>>> the sum total of "stonith-timeout" of STONITH plugins on the node
>>>>> which is going to receive the STONITH request from DC node, I think.
>>>> This would probably be very difficult for the CRM to get.
>>> Right, I agree with you.
>>> I meant "it is difficult because stonithd on DC can't know the values of
>>> stonith-timeout on other node." with the following sentence
>>> "But DC node can't know that...".
>>>>> But DC node can't know that...
>>>>> I would like to hear your opinions.
>>>> Sorry, but I couldn't exactly follow. Could you please describe
>>>> it in terms of actions.
>>> Sorry, I restate what I meant.
>>> The timeout which stonithd on DC waits for the return of other node's
>>> stonithd needs the value that is longer than the sum total of "stonith-timeout"
>>> of STONITH plugins on the node by all rights.
>>> But it is so difficult to get the values for DC-stonithd.
>>> Then I would like to hear your opinion about what is suitable and practical
>>> value as this timeout which is set in insert_into_executing_queue().
>>> I hope I conveyed to you what I want to say.
>>
>> OK, I suppose I understand now. You're talking about the timeouts
>> for remote fencing operations, right? And the originating
> Exactly!
>
>> stonithd hasn't got a clue on how long the remote fencing
>> operation may take. Well, that could be a problem. I can't think
>> of anything to resolve that completely, not without "rewiring"
>> stonithd. stonithd broadcasts the request so there's no way for
>> it to know who's doing what and when and how long it can take.
>>
>> The only workaround I can think of is to use the global (cluster
>> property) stonith-timeout which should be set to the maximum sum
>> of stonith timeouts for a node.
> All right.
> I misunderstood the role of the global stonith-timeout.
> I considered it just the default value for each plugin's stonith-timeout
> as if default-action-timeout is for each operation.
> To use stonith-timeout correctly (without troublesome timeouts),
> we should keep the following, right?
>  - set stonith-timeout for every STONITH plugin.
>  - set the global stonith-timeout to the maximum sum of stonith timeouts
>    for a node.
>  - (set cluster-delay to longer than global stonith-timeout,
>     at least at present.)

Right.

>> Now, back to reality ;-)  Timeouts are important, of course, but
>> one should usually leave a generous margin on top of the expected
>> duration. For instance, if the normal timeout for an operation on
>> a device is 30 seconds, there's nothing wrong in setting it to
>> say one or two minutes. The consequences of an operation ending
>> prematurely are much more serious than if one waits a bit longer.
>> After all, if there's something really wrong, it is usually
>> detected early and the error reported immediately. Of course,
>> one shouldn't follow this advice blindly. Know your cluster!
> Understood!
>
>>
>>> For reference, I attached logs when the aforesaid timeout occurs.
>>> The cluster has 3 nodes.
>>> When DC was going to be STONITH'ed, DC sent a request all of non-DC nodes,
>>> and all of them tried to shutdown DC.
>>
>> No, the tengine (running on DC) always talks to the local
>> stonithd.
> I meant "stontihd on DC broadcast a request"
> with the sentence "DC sent a request all of non-DC nodes".
> I'm sorry for being ambiguous.
>
>>
>>> And the timeout on DC-stonithd occured, DC-stonithd sent the same request,
>>> then two or more STONITH plugin worked in parallel on every non-DC node.
>>> (Please see sysstats.txt.)
>>>
>>> I want to make clear whether the current behavior is expected or a bug.
>>
>> That's actually wrong, but could be considered a configuration
>> problem:
>>
>>   <cluster_property_set id="cib-bootstrap-options">
>>   ...
>>         <nvpair id="nvpair.id2000009" name="stonith-timeout" value="260s"/>
>>   ...
>>   <primitive id="prmStonithN1" class="stonith" type="external/ssh">
>>   ...
>> 	  <nvpair id="nvpair.id2000602" name="stonith-timeout" value="390s"/>
>>
>> The stonithd initiator (the one running on the DC) times out
>> before the remote fencing operation. On retry a second remote
>> fencing operation is started. That's why you see two of them.
> I set these values because I wanted to know what would be happen
> when the timeout for remote fencing op occurs, and I intended to talk about it
> on MailingList if curious behavior appears. ;)
>
>>
>> Anyway, you can open a bugzilla for this, because the stonithd on
>> a remote host should know that there's already one operation
>> running. Unfortunately, I'm busy with more urgent matters right
>> now, so it may take a few weeks until I take a look at it.
>> As usual, patches are welcome :)
> I posted into bugzilla.
> http://developerbugs.linux-foundation.org/show_bug.cgi?id=1983
> I'm sorry to bother you.

Thanks for filing this.

Cheers,

Dejan

> Best Regards,
> SatomiTANIGUCHI
>
>
>>
>> Thanks,
>>
>> Dejan
>>
>>> But I consider that the root of every problem is the node which sends STONITH
>>> request and wait for completion of the op is killed.
>>>
>>>
>>> Regards,
>>> Satomi TANIGUCHI
>>>
>>>
>>>> Thanks,
>>>>
>>>> Dejan
>>>>
>>>>> Best Regards,
>>>>> Satomi TANIGUCHI
>>>>
>>>>> _______________________________________________________
>>>>> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>>>>> Home Page: http://linux-ha.org/
>>>> _______________________________________________________
>>>> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>>>> Home Page: http://linux-ha.org/
>>>
>>>
>>>
>>>
>>
>>
>>> _______________________________________________
>>> Pacemaker mailing list
>>> Pacemaker at clusterlabs.org
>>> http://list.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker at clusterlabs.org
>> http://list.clusterlabs.org/mailman/listinfo/pacemaker
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker