[Pacemaker] When STONITH is not completed, a resource starts.

Wed Jan 14 19:55:29 EST 2009

Hi Andrew,

> > It is time when STONITH is carried out in the environment of two nodes by a standby node.
> >
> > A resource is started without waiting for completion of STONITH from a DC node.
> > While STONITH is not completed, this problem happens if an active node fell.
> 
> So let me see if I understand this correctly...
> 
> You start with two healthy nodes.
Yes.

> 
> You cause a resource on A to fail, at which point B tries to shoot it.
Yes.

> 
> The stonith op never completes and before it times out, you restart B.
No.
It is node A to reboot. 
- Node A is the one that node B is going to shoot.

> 
> Resources get started on B.
Yes.
A dummy resource is started at the time of DC node B. 
When node B is not DC, it is not started.

> 
> Questions:
> 
> Is the above accurate?
> Is only the dummy resource started, or are other ones started too?
Yes.

> When B comes up again, does it form a two-node cluster with A?
> Is A still up or has it become the DC and shot itself?
I do not confirm the state after node A rebooted.

> Sorry, parsing error... I can't tell if you're saying the problem also
> exists for clusters based on OpenAIS.
> I think you're saying it does not happen if you use OpenAIS instead of
> Heartbeat.
Yes.
The same problem does not occur in OpenAIS.

In OpenAIS, transition of the start of the dummy resource seems to be stopped after a partner node
disappeared.

-----------------------------------------------------------------
Jan  9 13:34:30 ais-1 crmd: [16497]: info: ais_status_callback: status: ais-2 is now lost (was member)
Jan  9 13:34:30 ais-1 crmd: [16497]: info: crm_update_peer: Node ais-2: id=1234 state=lost (new)
addr=r(0) ip(192.168.70.60) r(1) ip(192.168.80.61)  votes=1 born=3556 seen=3556
proc=00000000000000000000000000053312
Jan  9 13:34:30 ais-1 crmd: [16497]: notice: crm_calculate_quorum: Membership 10: quorum lost
Jan  9 13:34:30 ais-1 crmd: [16497]: info: erase_node_from_join: Removed node ais-2 from join
calculations: welcomed=0 itegrated=0 finalized=0 confirmed=1
Jan  9 13:34:30 ais-1 cib: [16493]: info: ais_dispatch: Processing membership 3560
Jan  9 13:34:30 ais-1 cib: [16493]: info: crm_update_peer: Node ais-2: id=1234 state=lost (new)
addr=r(0) ip(192.168.70.60) r(1) ip(192.168.80.61)  votes=1 born=3556 seen=3556
proc=00000000000000000000000000053312
Jan  9 13:34:30 ais-1 crmd: [16497]: info: crm_update_quorum: Updating quorum status to false
(call=53)
Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] entering GATHER state from 0. 
Jan  9 13:34:30 ais-1 cib: [16493]: notice: crm_calculate_quorum: Membership 0: quorum lost
Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] Creating commit token because I am the rep. 
Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] Saving state aru 94 high seq received 94 
Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] Storing new sequence id for ring de8 
Jan  9 13:34:30 ais-1 cib: [16493]: info: cib_process_request: Operation complete: op cib_modify for
section nodes (origin=local/crmd/51): ok (rc=0)
Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] entering COMMIT state. 
Jan  9 13:34:30 ais-1 cib: [16493]: info: cib_config_changed: Attr changes
Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] entering RECOVERY state. 
Jan  9 13:34:30 ais-1 cib: [16493]: info: log_data_element: cib:diff: - <cib have-quorum="1"
admin_epoch="0" epoch="1003" num_updates="10" />
Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] position [0] member 192.168.70.50: 
Jan  9 13:34:30 ais-1 cib: [16493]: info: log_data_element: cib:diff: + <cib have-quorum="0"
admin_epoch="0" epoch="1004" num_updates="1" />
Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] previous ring seq 3556 rep 192.168.70.50 
Jan  9 13:34:30 ais-1 cib: [16493]: info: cib_process_request: Operation complete: op cib_modify for
section cib (origin=local/crmd/53): ok (rc=0)
Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] aru 94 high delivered 94 received flag 1 
Jan  9 13:34:30 ais-1 cib: [16493]: info: cib_process_request: Operation complete: op cib_modify for
section nodes (origin=local/crmd/54): ok (rc=0)
Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] Did not need to originate any messages in recovery. 
Jan  9 13:34:30 ais-1 openais[16486]: [TOTEM] Sending initial ORF token 
Jan  9 13:34:30 ais-1 crmd: [16497]: info: abort_transition_graph: need_abort:60 - Triggered
transition abort (complete=0) : Non-status change
-----------------------------------------------------------------

Best Regards,
Hideo Yamauchi.

--- Andrew Beekhof <beekhof at gmail.com> wrote:

> On Wed, Jan 14, 2009 at 09:59,  <renayama19661014 at ybb.ne.jp> wrote:
> > Hi,
> >
> >> > 1)I make it the state that a resource starts in a standby node.
> >> > 2)I change it so that a stop error occurs in a dummy resource.
> >> > 3)I generate the monitor error of the dummy resource in a standby
> >> > node.
> >> > 4)After a stop error, STONITH is carried out by a partner node.
> >> > 5)Keep STONITH from a standby node waiting.
> >> > 6)While STONITH is not completed, I reboot a standby node.
> >>
> >> Is this in a two-node cluster?
> > Yes.
> >
> >> > Though STONITH from a DC node does not succeed, a resource is started.
> >> > When STONITH did not succeed, the resource was not started at a non-
> >> > DC node.
> >>
> >> I don't understand what you're saying here.
> >> The first statement says a resource was started and the second says it
> >> wasn't... they can't both be true.
> >
> > I'm sorry.
> > It caused misunderstanding.
> >
> > It is time when STONITH is carried out in the environment of two nodes by a standby node.
> >
> > A resource is started without waiting for completion of STONITH from a DC node.
> > While STONITH is not completed, this problem happens if an active node fell.
> 
> So let me see if I understand this correctly...
> 
> You start with two healthy nodes.
> 
> You cause a resource on A to fail, at which point B tries to shoot it.
> 
> The stonith op never completes and before it times out, you restart B.
> 
> Resources get started on B.
> 
> Questions:
> 
> Is the above accurate?
> Is only the dummy resource started, or are other ones started too?
> When B comes up again, does it form a two-node cluster with A?
> Is A still up or has it become the DC and shot itself?
> 
> >
> > I confirmed the same confirmation based on OpenAIS.
> > However, in OpenAIS, the same problem did not occur.
> > In OpenAIS, the start of the resource is evaded well.
> 
> Sorry, parsing error... I can't tell if you're saying the problem also
> exists for clusters based on OpenAIS.
> I think you're saying it does not happen if you use OpenAIS instead of
> Heartbeat.
> 
> >
> > --- Andrew Beekhof <beekhof at gmail.com> wrote:
> >
> >>
> >> On Jan 14, 2009, at 2:52 AM, <renayama19661014 at ybb.ne.jp> <renayama19661014 at ybb.ne.jp
> >>  > wrote:
> >>
> >> > Hi,
> >> >
> >> > About movement of STONITH, I tested it.
> >> > (heartbeat 2.99.2 + Pacemaker-1-0-6fd0eebd186e.tar.gz on
> >> > RHEL5.2(i386VM))
> >> >
> >> > When what I confirmed carries out STONITH from a DC node and a non-
> >> > DC node.
> >> >
> >> > I confirmed it in the next flow.
> >> >
> >> > 1)I make it the state that a resource starts in a standby node.
> >> > 2)I change it so that a stop error occurs in a dummy resource.
> >> > 3)I generate the monitor error of the dummy resource in a standby
> >> > node.
> >> > 4)After a stop error, STONITH is carried out by a partner node.
> >> > 5)Keep STONITH from a standby node waiting.
> >> > 6)While STONITH is not completed, I reboot a standby node.
> >>
> >> Is this in a two-node cluster?
> >>
> >> > I watched log.
> >>
> >> >
> >> > Though STONITH from a DC node does not succeed, a resource is started.
> >> > When STONITH did not succeed, the resource was not started at a non-
> >> > DC node.
> >>
> >> I don't understand what you're saying here.
> >> The first statement says a resource was started and the second says it
> >> wasn't... they can't both be true.
> >>
> >> >
> >> >
> >> > ---------------------------------------------------------------------------
> >> > Jan 13 16:01:25 ais-1 crmd: [6003]: info: send_rsc_command:
> >> > Initiating action 7: start
> >> > prmDummy1_start_0 on ais-1
> >> > ---------------------------------------------------------------------------
> >> >
> >> > When STONITH did not succeed, I thought that the resource did not
> >> > start.
> >> > Does not the behavior when STONITH failed from a DC node have a
> >> > problem?
> >> >
> >> > I attach a result of hb_report.
> >> > - stonith_exec_dc.tar.gz (A result when STONITH was carried out by a
> >> > DC node(ais-1))
> >> > - stonith_exec_nodc.tar.gz(A result when STONITH was carried out by
> >> > a non-DC node(ais-1))
> 
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>