[Pacemaker] About behavior in "Action Lost".

Thu Oct 7 06:12:40 EDT 2010

On Thu, Oct 7, 2010 at 11:48 AM, Keisuke MORI <keisuke.mori+ha at gmail.com> wrote:
> Andrew,
>
> 2010/9/23 Andrew Beekhof <andrew at beekhof.net>:
>> Pushed as:
>>   http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18
>>
>> Not sure about applying to 1.0 though, its a dramatic change in behavior.
>
> I would like to backport this to 1.0.
> Would you agree with this?

I would prefer not to, but if it is important to you then I will agree.

>
> Without this the failed node was not fenced when it ought to be and
> failed to continue the service.
> I would also think that it would be good to have the same behavior
> between 1.0 and 1.1 in such a critical condition to support both
> versions better.
>
> Thanks,
> Keisuke MORI
>
>>
>> On Wed, Sep 22, 2010 at 11:18 AM,  <renayama19661014 at ybb.ne.jp> wrote:
>>> Hi Andrew,
>>>
>>> Thank you for comment.
>>>
>>>> A long time ago in a galaxy far away, some messaging layers used to
>>>> loose quite a few actions, including stops.
>>>> About the same time, we decided that fencing because a stop action was
>>>> lost wasn't a good idea.
>>>>
>>>> The rationale was that if the operation eventually completed, it would
>>>> end up in the CIB anyway.
>>>> And even if it didn't, the PE would continue to try the operation
>>>> again until the whole node fell over at which point it would get shot
>>>> anyway.
>>>
>>> Sorry...
>>> I did not know the fact that there was such an argument in old days.
>>>
>>>
>>>> Now, having said that, things have improved since then and perhaps,
>>>> the interest of speeding up recovery in these situations, it is time
>>>> to stop treating stop operations differently.
>>>> Would you agree?
>>>
>>> That means, you change it in the case of "Action Lost" of the stop this time to carry out stonith?
>>> If my recognition is right, I agree too.
>>>
>>> if(timer->action->type != action_type_rsc) {
>>> send_update = FALSE;
>>> } else if(safe_str_eq(task, "cancel")) {
>>> /* we dont need to update the CIB with these */
>>> send_update = FALSE;
>>> }
>>> ---> delete "else if(safe_str_eq(task, "stop")){..}" ?
>>>
>>> if(send_update) {
>>> /* cib_action_update(timer->action, LRM_OP_PENDING, EXECRA_STATUS_UNKNOWN); */
>>> cib_action_update(timer->action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR);
>>> }
>>>
>>> Best Regards,
>>> Hideo Yamauchi.
>>>
>>> --- Andrew Beekhof <andrew at beekhof.net> wrote:
>>>
>>>> On Tue, Sep 21, 2010 at 8:59 AM,  <renayama19661014 at ybb.ne.jp> wrote:
>>>> > Hi,
>>>> >
>>>> > Node was in state that the load was very high, and we confirmed monitor movement of Pacemeker.
>>>> > Action Lost occurred in stop movement after the error of the monitor occurred.
>>>> >
>>>> > Sep �8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting transition, action lost:
>>>> [Action 9]:
>>>> > In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0)
>>>> > Sep �8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: action_timer_callback:486
>>> -
>>>> > Triggered transition abort (complete=0) : Action lost
>>>> >
>>>> >
>>>> > For the load of the node, We think that the stop movement did not go well.
>>>> > But cannot nodes execute stonith.
>>>>
>>>> A long time ago in a galaxy far away, some messaging layers used to
>>>> loose quite a few actions, including stops.
>>>> About the same time, we decided that fencing because a stop action was
>>>> lost wasn't a good idea.
>>>>
>>>> The rationale was that if the operation eventually completed, it would
>>>> end up in the CIB anyway.
>>>> And even if it didn't, the PE would continue to try the operation
>>>> again until the whole node fell over at which point it would get shot
>>>> anyway.
>>>>
>>>> Now, having said that, things have improved since then and perhaps,
>>>> the interest of speeding up recovery in these situations, it is time
>>>> to stop treating stop operations differently.
>>>> Would you agree?
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>
>
>
> --
> Keisuke MORI
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>