[ClusterLabs] Colocation and ordering with live migration

Mon Oct 10 16:56:39 UTC 2016

On 10/10/2016 10:21 AM, Klaus Wenninger wrote:
> On 10/10/2016 04:54 PM, Ken Gaillot wrote:
>> On 10/10/2016 07:36 AM, Pavel Levshin wrote:
>>> 10.10.2016 15:11, Klaus Wenninger:
>>>> On 10/10/2016 02:00 PM, Pavel Levshin wrote:
>>>>> 10.10.2016 14:32, Klaus Wenninger:
>>>>>> Why are the order-constraints between libvirt & vms optional?
>>>>> If they were mandatory, then all the virtual machines would be
>>>>> restarted when libvirtd restarts. This is not desired nor needed. When
>>>>> this happens, the node is fenced because it is unable to restart VM in
>>>>> absence of working libvirtd.
>>>> Was guessing something like that ...
>>>> So let me reformulate my question:
>>>>    Why does libvirtd have to be restarted?
>>>> If it is because of config-changes making it reloadable might be a
>>>> solution ...
>>>>
>>> Right, config changes come to my mind first of all. But sometimes a
>>> service, including libvirtd, may fail unexpectedly. In this case I would
>>> prefer to restart it without disturbing VirtualDomains, which will fail
>>> eternally.
>> I think the mandatory colocation of VMs with libvirtd negates your goal.
>> If libvirtd stops, the VMs will have to stop anyway because they can't
>> be colocated with libvirtd. Making the colocation optional should fix that.
>>
>>> The question is, why the cluster does not obey optional constraint, when
>>> both libvirtd and VM stop in a single transition?
>> If it truly is in the same transition, then it should be honored.
>>
>> You have *mandatory* constraints for DLM -> CLVMd -> cluster-config ->
>> libvirtd, but only an *optional* constraint for libvirtd -> VMs.
>> Therefore, libvirtd will generally have to wait longer than the VMs to
>> be started.
>>
>> It might help to add mandatory constraints for cluster-config -> VMs.
>> That way, they have the same requirements as libvirtd, and are more
>> likely to start in the same transition.
>>
>> However I'm sure there are still problematic situations. What you want
>> is a simple idea, but a rather complex specification: "If rsc1 fails,
>> block any instances of this other RA on the same node."
>>
>> It might be possible to come up with some node attribute magic to
>> enforce this. You'd need some custom RAs. I imagine something like one
>> RA that sets a node attribute, and another RA that checks it.
>>
>> The setter would be grouped with libvirtd. Anytime that libvirtd starts,
>> the setter would set a node attribute on the local node. Anytime that
>> libvirtd stopped or failed, the setter would unset the attribute value.
>>
>> The checker would simply monitor the attribute, and fail if the
>> attribute is unset. The group would have on-fail=block. So anytime the
>> the attribute was unset, the VM would not be started or stopped. (There
>> would be no constraints between the two groups -- the checker RA would
>> take the place of constraints.)
> 
> In how far would that behave differently to just putting libvirtd
> into this on-fail=block group? (apart from of course the
> possibility to group the vms into more than one group ...)

You could stop or restart libvirtd without stopping the VMs. It would
cause a "failure" of the checker that would need to be cleaned later,
but the VMs wouldn't stop.

>>
>> I haven't thought through all possible scenarios, but it seems feasible
>> to me.
>>
>>> In my eyes, these services are bound by a HARD obvious colocation
>>> constraint: VirtualDomain should never ever be touched in absence of
>>> working libvirtd. Unfortunately, I cannot figure out a way to reflect
>>> this constraint in the cluster.
>>>
>>>
>>> -- 
>>> Pavel Levshin