[Pacemaker] When STONITH is not completed, a resource starts.

Thu Jan 15 07:54:16 EST 2009

On Jan 15, 2009, at 1:55 AM, <renayama19661014 at ybb.ne.jp> <renayama19661014 at ybb.ne.jp 
 > wrote:

> Hi Andrew,
>
>>> It is time when STONITH is carried out in the environment of two  
>>> nodes by a standby node.
>>>
>>> A resource is started without waiting for completion of STONITH  
>>> from a DC node.
>>> While STONITH is not completed, this problem happens if an active  
>>> node fell.
>>
>> So let me see if I understand this correctly...
>>
>> You start with two healthy nodes.
> Yes.
>
>>
>> You cause a resource on A to fail, at which point B tries to shoot  
>> it.
> Yes.
>
>>
>> The stonith op never completes and before it times out, you restart  
>> B.
> No.
> It is node A to reboot.
> - Node A is the one that node B is going to shoot.

Ah!
Can you log a bug for this please?

>
>
>>
>> Resources get started on B.
> Yes.
> A dummy resource is started at the time of DC node B.
> When node B is not DC, it is not started.
>
>>
>> Questions:
>>
>> Is the above accurate?
>> Is only the dummy resource started, or are other ones started too?
> Yes.

There were two alternatives in that question, the answer cant be  
"yes" :)

>
>
>> When B comes up again, does it form a two-node cluster with A?
>> Is A still up or has it become the DC and shot itself?
> I do not confirm the state after node A rebooted.
>
>> Sorry, parsing error... I can't tell if you're saying the problem  
>> also
>> exists for clusters based on OpenAIS.
>> I think you're saying it does not happen if you use OpenAIS instead  
>> of
>> Heartbeat.
> Yes.
> The same problem does not occur in OpenAIS.

Excellent

>
>
> In OpenAIS, transition of the start of the dummy resource seems to  
> be stopped after a partner node
> disappeared.
>
> -----------------------------------------------------------------
> Jan  9 13:34:30 ais-1 crmd: [16497]: info: ais_status_callback:  
> status: ais-2 is now lost (was member)
> Jan  9 13:34:30 ais-1 crmd: [16497]: info: crm_update_peer: Node  
> ais-2: id=1234 state=lost (new)
> addr=r(0) ip(192.168.70.60) r(1) ip(192.168.80.61)  votes=1  
> born=3556 seen=3556
> proc=00000000000000000000000000053312
> Jan  9 13:34:30 ais-1 crmd: [16497]: notice: crm_calculate_quorum:  
> Membership 10: quorum lost
> Jan  9 13:34:30 ais-1 crmd: [16497]: info: erase_node_from_join:  
> Removed node ais-2 from join
> calculations: welcomed=0 itegrated=0 finalized=0 confirmed=1
> Jan  9 13:34:30 ais-1 cib: [16493]: info: ais_dispatch: Processing  
> membership 3560
> Jan  9 13:34:30 ais-1 cib: [16493]: info: crm_update_peer: Node  
> ais-2: id=1234 state=lost (new)
> addr=r(0) ip(192.168.70.60) r(1) ip(192.168.80.61)  votes=1  
> born=3556 seen=3556
> proc=00000000000000000000000000053312
> Jan  9 13:34:30 ais-1 crmd: [16497]: info: crm_update_quorum:  
> Updating quorum status to false
> (call=53)
> Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] entering GATHER state  
> from 0.
> Jan  9 13:34:30 ais-1 cib: [16493]: notice: crm_calculate_quorum:  
> Membership 0: quorum lost
> Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] Creating commit token  
> because I am the rep.
> Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] Saving state aru 94  
> high seq received 94
> Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] Storing new sequence  
> id for ring de8
> Jan  9 13:34:30 ais-1 cib: [16493]: info: cib_process_request:  
> Operation complete: op cib_modify for
> section nodes (origin=local/crmd/51): ok (rc=0)
> Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] entering COMMIT state.
> Jan  9 13:34:30 ais-1 cib: [16493]: info: cib_config_changed: Attr  
> changes
> Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] entering RECOVERY state.
> Jan  9 13:34:30 ais-1 cib: [16493]: info: log_data_element:  
> cib:diff: - <cib have-quorum="1"
> admin_epoch="0" epoch="1003" num_updates="10" />
> Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] position [0] member  
> 192.168.70.50:
> Jan  9 13:34:30 ais-1 cib: [16493]: info: log_data_element:  
> cib:diff: + <cib have-quorum="0"
> admin_epoch="0" epoch="1004" num_updates="1" />
> Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] previous ring seq 3556  
> rep 192.168.70.50
> Jan  9 13:34:30 ais-1 cib: [16493]: info: cib_process_request:  
> Operation complete: op cib_modify for
> section cib (origin=local/crmd/53): ok (rc=0)
> Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] aru 94 high delivered  
> 94 received flag 1
> Jan  9 13:34:30 ais-1 cib: [16493]: info: cib_process_request:  
> Operation complete: op cib_modify for
> section nodes (origin=local/crmd/54): ok (rc=0)
> Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] Did not need to  
> originate any messages in recovery.
> Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] Sending initial ORF  
> token
> Jan  9 13:34:30 ais-1 crmd: [16497]: info: abort_transition_graph:  
> need_abort:60 - Triggered
> transition abort (complete=0) : Non-status change
> -----------------------------------------------------------------
>
> Best Regards,
> Hideo Yamauchi.
>
>
> --- Andrew Beekhof <beekhof at gmail.com> wrote:
>
>> On Wed, Jan 14, 2009 at 09:59,  <renayama19661014 at ybb.ne.jp> wrote:
>>> Hi,
>>>
>>>>> 1)I make it the state that a resource starts in a standby node.
>>>>> 2)I change it so that a stop error occurs in a dummy resource.
>>>>> 3)I generate the monitor error of the dummy resource in a standby
>>>>> node.
>>>>> 4)After a stop error, STONITH is carried out by a partner node.
>>>>> 5)Keep STONITH from a standby node waiting.
>>>>> 6)While STONITH is not completed, I reboot a standby node.
>>>>
>>>> Is this in a two-node cluster?
>>> Yes.
>>>
>>>>> Though STONITH from a DC node does not succeed, a resource is  
>>>>> started.
>>>>> When STONITH did not succeed, the resource was not started at a  
>>>>> non-
>>>>> DC node.
>>>>
>>>> I don't understand what you're saying here.
>>>> The first statement says a resource was started and the second  
>>>> says it
>>>> wasn't... they can't both be true.
>>>
>>> I'm sorry.
>>> It caused misunderstanding.
>>>
>>> It is time when STONITH is carried out in the environment of two  
>>> nodes by a standby node.
>>>
>>> A resource is started without waiting for completion of STONITH  
>>> from a DC node.
>>> While STONITH is not completed, this problem happens if an active  
>>> node fell.
>>
>> So let me see if I understand this correctly...
>>
>> You start with two healthy nodes.
>>
>> You cause a resource on A to fail, at which point B tries to shoot  
>> it.
>>
>> The stonith op never completes and before it times out, you restart  
>> B.
>>
>> Resources get started on B.
>>
>> Questions:
>>
>> Is the above accurate?
>> Is only the dummy resource started, or are other ones started too?
>> When B comes up again, does it form a two-node cluster with A?
>> Is A still up or has it become the DC and shot itself?
>>
>>>
>>> I confirmed the same confirmation based on OpenAIS.
>>> However, in OpenAIS, the same problem did not occur.
>>> In OpenAIS, the start of the resource is evaded well.
>>
>> Sorry, parsing error... I can't tell if you're saying the problem  
>> also
>> exists for clusters based on OpenAIS.
>> I think you're saying it does not happen if you use OpenAIS instead  
>> of
>> Heartbeat.
>>
>>>
>>> --- Andrew Beekhof <beekhof at gmail.com> wrote:
>>>
>>>>
>>>> On Jan 14, 2009, at 2:52 AM, <renayama19661014 at ybb.ne.jp> <renayama19661014 at ybb.ne.jp
>>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> About movement of STONITH, I tested it.
>>>>> (heartbeat 2.99.2 + Pacemaker-1-0-6fd0eebd186e.tar.gz on
>>>>> RHEL5.2(i386VM))
>>>>>
>>>>> When what I confirmed carries out STONITH from a DC node and a  
>>>>> non-
>>>>> DC node.
>>>>>
>>>>> I confirmed it in the next flow.
>>>>>
>>>>> 1)I make it the state that a resource starts in a standby node.
>>>>> 2)I change it so that a stop error occurs in a dummy resource.
>>>>> 3)I generate the monitor error of the dummy resource in a standby
>>>>> node.
>>>>> 4)After a stop error, STONITH is carried out by a partner node.
>>>>> 5)Keep STONITH from a standby node waiting.
>>>>> 6)While STONITH is not completed, I reboot a standby node.
>>>>
>>>> Is this in a two-node cluster?
>>>>
>>>>> I watched log.
>>>>
>>>>>
>>>>> Though STONITH from a DC node does not succeed, a resource is  
>>>>> started.
>>>>> When STONITH did not succeed, the resource was not started at a  
>>>>> non-
>>>>> DC node.
>>>>
>>>> I don't understand what you're saying here.
>>>> The first statement says a resource was started and the second  
>>>> says it
>>>> wasn't... they can't both be true.
>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------------
>>>>> Jan 13 16:01:25 ais-1 crmd: [6003]: info: send_rsc_command:
>>>>> Initiating action 7: start
>>>>> prmDummy1_start_0 on ais-1
>>>>> ---------------------------------------------------------------------------
>>>>>
>>>>> When STONITH did not succeed, I thought that the resource did not
>>>>> start.
>>>>> Does not the behavior when STONITH failed from a DC node have a
>>>>> problem?
>>>>>
>>>>> I attach a result of hb_report.
>>>>> - stonith_exec_dc.tar.gz (A result when STONITH was carried out  
>>>>> by a
>>>>> DC node(ais-1))
>>>>> - stonith_exec_nodc.tar.gz(A result when STONITH was carried out  
>>>>> by
>>>>> a non-DC node(ais-1))
>>
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker