[Pacemaker] Orphan problem when creating a clone of a group

Mon Nov 29 12:08:11 EST 2010

Zitat von Dejan Muhamedagic <dejanmm at fastmail.fm>:

> Hi,
>
> On Mon, Nov 29, 2010 at 02:42:42PM +0100, Uwe Grawert wrote:
>> Was: Re: [Pacemaker] crm resource restart doesn't restart the  
>> correct resource
>>
>> Zitat von Dejan Muhamedagic <dejanmm at fastmail.fm>:
>>
>>>> This is happening, because, when the clone is created,
>>>> pacemaker stops the primitive but does not wait for the stop action
>>>> to return, and just starts the primitive over. And that off course
>>>> causes problems.
>>>
>>> Hmm, don't quite understand what is going on. Is that primitive
>>> part of the group? Can you describe in more detail what is going
>>> on.
>>
>> I have a group (grp_fs) consisting of a LVM and several Filesystem
>> resources, in that order. That group is started and all resources are
>> running. Now I do clone this group by issuing:
>>
>> crm configure clone clo_fs grp_fs
>>
>> That does stop all resources and starts them again as clone. But
>> Pacemaker does not seem to wait until the stop action has finished. I
>> have modified the LVM RA to log the action command issued to the agent
>> and the value returned by the agent:
>>
>> 14:24:11 [ 14495 ] Action: start
>> 14:24:11 [ 14494 ] Action: stop
>> 14:24:13 [ 14494 ] RC: 1
>> 14:24:14 [ 14495 ] RC: 0
>> 14:24:14 [ 14599 ] Action: monitor
>> 14:24:14 [ 14599 ] RC: 0
>>
>> In brackets you see the PID. As can be seen, Pacemaker first issues a
>> start command and then immediately a stop afterwards, not waiting for
>> the first command to return. That produces an orphan resource. That
>> involves that the state of the LVM resource (which is now cloned) is
>> uncertain. It can happen to start but it can also fail.
>
> I see. The problem here is that as far as the cluster's
> concerned, the new resources and the old resources are
> unrelated: they have different names (before it was say lvm1 and
> now it's lvm1:0). I'm not sure if the crmd/pengine can tell if
> the resources of the group which are running actually belong to
> the cloned group as well. Andrew? If not, then we'll have to
> forbid creating a clone of running resources in the shell.

Ok, if it is going to be forbidden to clone a running resource, there  
is a problem with groups. A stopped primitive is getting its  
target-role property cleared when cloned. A group does not! If I stop  
a group, make a clone and try to start the clone, nothing happens  
until the target-role="stopped" is cleared manually from the CIB.  
Stopping a primitive in that group (say the first one) has the same  
effect. As long as some resource or group in the clone has the  
target-role property set, nothing will happen.