[ClusterLabs] Antw: Re: Xen Migration/resource cleanup problem in SLES11 SP3

Fri Oct 9 06:08:59 UTC 2015

>>> Dejan Muhamedagic <dejanmm at fastmail.fm> schrieb am 08.10.2015 um 16:13 in
Nachricht <20151008141357.GB15084 at tuttle.linbit>:
> Hi,
> 
> On Thu, Oct 08, 2015 at 02:29:08PM +0200, Ulrich Windl wrote:
>> Hi!
>> 
>> I'd like to report an "interesting problem" with SLES11 SP3+HAE (latest 
> updates):
>> 
>> When doing "rcopenais stop" on node "h10" with three Xen-VMs running, the 
> cluster tried to migrate those VMs to other nodes (OK).
>> 
>> However migration failed on the remote nodes, but the cluster thought 
> migration was successfully. Later the cluster restarted the VMs (BAD).
>> 
>> Oct  8 13:19:17 h10 Xen(prm_xen_v07)[16537]: INFO: v07: xm migrate to h01 
> succeeded.
>> Oct  8 13:20:38 h01 Xen(prm_xen_v07)[9027]: ERROR: v07: Not active locally, 
> migration failed!
> 
> xm did report success in migrate_to, but the overall migration
> should've been considered failed, because migrate_from failed. Do
> you have a too low timeout? The failure msg is logged 81 second
> later, provided the clocks are in sync.

First, the timeout is in the order of 5 Minutes, and the clocks are "very much in sync" (TM) ;-)
The reason is that Xen failed to unpause the VM. The guess is that the node where the VM (para virtualized) started has a somewhat newer CPU than the target node, and this fact causes the migration to fail.
In an ideal world Xen wouldn't even start to try a migration if the CPU on the target node cannot run the VM. In a less perfect world this error should be detected after failure at least.

> 
>> Oct  8 13:44:53 h01 pengine[18985]:  warning: unpack_rsc_op_failure: 
> Processing failed op migrate_from for prm_xen_v07 on h01: unknown error (1)
>> 
>> Things are really bad after h10 was rebooted eventually: The cluster 
> restarted the three VMs again, because it thought those VMs were still 
> running on h10! (VERY BAD)
>> During startup, the cluster did nor probe the three VMs.
> 
> If a node restarted, how could anything think that there was
> anything there still running. Strange.

Well basically when you start the cluster node it does not mean that the OS on the node has just been rebootet, so rousources might have been messed with outside the cluster (it was not, but it could be). Thus probing on node startup seems like a good idea.

> 
> But anyway, the if the migrate_from fails, then the resource
> should still be running at the origin host, right?

No, because it wasn't running where the cluster thought it's running. So after a failed migration the VM isn't running (as it did before migration).

So actually we have two problems:
1: Xen migration failure is not detected in-time by the cluster
2: The cluster mixes up nodes and node configurations (this problem has a SR at SUSE for six moths at least, but nobody (me included) seems to know what's wrong. I'd bet that it's a very obscure bug in the cluster communication layer...

Regards,
Ulrich