[Pacemaker] Orphan problem when creating a clone of a group

Fri Dec 3 05:37:13 EST 2010

Hi,

On Mon, Nov 29, 2010 at 06:08:11PM +0100, Uwe Grawert wrote:
> Zitat von Dejan Muhamedagic <dejanmm at fastmail.fm>:
> 
> >Hi,
> >
> >On Mon, Nov 29, 2010 at 02:42:42PM +0100, Uwe Grawert wrote:
> >>Was: Re: [Pacemaker] crm resource restart doesn't restart the
> >>correct resource
> >>
> >>Zitat von Dejan Muhamedagic <dejanmm at fastmail.fm>:
> >>
> >>>>This is happening, because, when the clone is created,
> >>>>pacemaker stops the primitive but does not wait for the stop action
> >>>>to return, and just starts the primitive over. And that off course
> >>>>causes problems.
> >>>
> >>>Hmm, don't quite understand what is going on. Is that primitive
> >>>part of the group? Can you describe in more detail what is going
> >>>on.
> >>
> >>I have a group (grp_fs) consisting of a LVM and several Filesystem
> >>resources, in that order. That group is started and all resources are
> >>running. Now I do clone this group by issuing:
> >>
> >>crm configure clone clo_fs grp_fs
> >>
> >>That does stop all resources and starts them again as clone. But
> >>Pacemaker does not seem to wait until the stop action has finished. I
> >>have modified the LVM RA to log the action command issued to the agent
> >>and the value returned by the agent:
> >>
> >>14:24:11 [ 14495 ] Action: start
> >>14:24:11 [ 14494 ] Action: stop
> >>14:24:13 [ 14494 ] RC: 1
> >>14:24:14 [ 14495 ] RC: 0
> >>14:24:14 [ 14599 ] Action: monitor
> >>14:24:14 [ 14599 ] RC: 0
> >>
> >>In brackets you see the PID. As can be seen, Pacemaker first issues a
> >>start command and then immediately a stop afterwards, not waiting for
> >>the first command to return. That produces an orphan resource. That
> >>involves that the state of the LVM resource (which is now cloned) is
> >>uncertain. It can happen to start but it can also fail.
> >
> >I see. The problem here is that as far as the cluster's
> >concerned, the new resources and the old resources are
> >unrelated: they have different names (before it was say lvm1 and
> >now it's lvm1:0). I'm not sure if the crmd/pengine can tell if
> >the resources of the group which are running actually belong to
> >the cloned group as well. Andrew? If not, then we'll have to
> >forbid creating a clone of running resources in the shell.
> 
> Ok, if it is going to be forbidden to clone a running resource,
> there is a problem with groups. A stopped primitive is getting its
> target-role property cleared when cloned. A group does not! If I
> stop a group, make a clone and try to start the clone, nothing
> happens until the target-role="stopped" is cleared manually from the
> CIB. Stopping a primitive in that group (say the first one) has the
> same effect. As long as some resource or group in the clone has the
> target-role property set, nothing will happen.

That bug was fixed yesterday in the 1.1 repository:

changeset:   10433:e99aa3451ce7
user:        Dejan Muhamedagic <dejan at hello-penguin.com>
date:        Thu Dec 02 16:52:37 2010 +0100
summary:     Medium: Shell: repair management of cloned groups

Thanks for reporting.

Cheers,

Dejan

> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker