[Pacemaker] About behavior in "Action Lost".

Thu Oct 7 05:48:11 EDT 2010

Andrew,

2010/9/23 Andrew Beekhof <andrew at beekhof.net>:
> Pushed as:
>   http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18
>
> Not sure about applying to 1.0 though, its a dramatic change in behavior.

I would like to backport this to 1.0.
Would you agree with this?

Without this the failed node was not fenced when it ought to be and
failed to continue the service.
I would also think that it would be good to have the same behavior
between 1.0 and 1.1 in such a critical condition to support both
versions better.

Thanks,
Keisuke MORI

>
> On Wed, Sep 22, 2010 at 11:18 AM,  <renayama19661014 at ybb.ne.jp> wrote:
>> Hi Andrew,
>>
>> Thank you for comment.
>>
>>> A long time ago in a galaxy far away, some messaging layers used to
>>> loose quite a few actions, including stops.
>>> About the same time, we decided that fencing because a stop action was
>>> lost wasn't a good idea.
>>>
>>> The rationale was that if the operation eventually completed, it would
>>> end up in the CIB anyway.
>>> And even if it didn't, the PE would continue to try the operation
>>> again until the whole node fell over at which point it would get shot
>>> anyway.
>>
>> Sorry...
>> I did not know the fact that there was such an argument in old days.
>>
>>
>>> Now, having said that, things have improved since then and perhaps,
>>> the interest of speeding up recovery in these situations, it is time
>>> to stop treating stop operations differently.
>>> Would you agree?
>>
>> That means, you change it in the case of "Action Lost" of the stop this time to carry out stonith?
>> If my recognition is right, I agree too.
>>
>> if(timer->action->type != action_type_rsc) {
>> send_update = FALSE;
>> } else if(safe_str_eq(task, "cancel")) {
>> /* we dont need to update the CIB with these */
>> send_update = FALSE;
>> }
>> ---> delete "else if(safe_str_eq(task, "stop")){..}" ?
>>
>> if(send_update) {
>> /* cib_action_update(timer->action, LRM_OP_PENDING, EXECRA_STATUS_UNKNOWN); */
>> cib_action_update(timer->action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR);
>> }
>>
>> Best Regards,
>> Hideo Yamauchi.
>>
>> --- Andrew Beekhof <andrew at beekhof.net> wrote:
>>
>>> On Tue, Sep 21, 2010 at 8:59 AM,  <renayama19661014 at ybb.ne.jp> wrote:
>>> > Hi,
>>> >
>>> > Node was in state that the load was very high, and we confirmed monitor movement of Pacemeker.
>>> > Action Lost occurred in stop movement after the error of the monitor occurred.
>>> >
>>> > Sep �8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting transition, action lost:
>>> [Action 9]:
>>> > In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0)
>>> > Sep �8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: action_timer_callback:486
>> -
>>> > Triggered transition abort (complete=0) : Action lost
>>> >
>>> >
>>> > For the load of the node, We think that the stop movement did not go well.
>>> > But cannot nodes execute stonith.
>>>
>>> A long time ago in a galaxy far away, some messaging layers used to
>>> loose quite a few actions, including stops.
>>> About the same time, we decided that fencing because a stop action was
>>> lost wasn't a good idea.
>>>
>>> The rationale was that if the operation eventually completed, it would
>>> end up in the CIB anyway.
>>> And even if it didn't, the PE would continue to try the operation
>>> again until the whole node fell over at which point it would get shot
>>> anyway.
>>>
>>> Now, having said that, things have improved since then and perhaps,
>>> the interest of speeding up recovery in these situations, it is time
>>> to stop treating stop operations differently.
>>> Would you agree?
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

-- 
Keisuke MORI