[ClusterLabs] resources do not migrate although node is going to standby

Mon Jul 24 21:51:30 UTC 2017

On Mon, 2017-07-24 at 20:52 +0200, Lentes, Bernd wrote:
> Hi,
> 
> just to be sure:
> i have a VirtualDomain resource (called prim_vm_servers_alive) running on one node (ha-idg-2). From reasons i don't remember i have a location constraint:
> location cli-prefer-prim_vm_servers_alive prim_vm_servers_alive role=Started inf: ha-idg-2
> 
> Now i try to set this node into standby, because i need it to reboot.
> From what i think now the resource can't migrate to node ha-idg-1 because of this constraint. Right ?

Right, the "inf:" makes it mandatory. BTW, the "cli-" at the beginning
indicates that this was created by a command-line tool such as pcs, crm
shell or crm_resource. Such tools implement "ban"/"move" type commands
by adding such constraints, and then offer a separate manual command to
remove such constraints (e.g. "pcs resource clear").

> 
> That's what the log says:
> Jul 21 18:03:50 ha-idg-2 VirtualDomain(prim_vm_servers_alive)[28565]: ERROR: Server_Monitoring: live migration to qemu+ssh://ha-idg-1/system  failed: 1
> Jul 21 18:03:50 ha-idg-2 lrmd[8573]:   notice: operation_finished: prim_vm_servers_alive_migrate_to_0:28565:stderr [ error: Requested operation is not valid: domain 'Server_Monitoring' is already active ]
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:   notice: process_lrm_event: Operation prim_vm_servers_alive_migrate_to_0: unknown error (node=ha-idg-2, call=114, rc=1, cib-update=572, confirmed=true)
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:   notice: process_lrm_event: ha-idg-2-prim_vm_servers_alive_migrate_to_0:114 [ error: Requested operation is not valid: domain 'Server_Monitoring' is already active\n ]
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:  warning: status_from_rc: Action 64 (prim_vm_servers_alive_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:   notice: abort_transition_graph: Transition aborted by prim_vm_servers_alive_migrate_to_0 'modify' on ha-idg-2: Event failed (magic=0:1;64:417:0:656ecd4a-f8e8-46c9-b4e6-194616237988, cib=0.879.5, sou
> rce=match_graph_event:350, 0)
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:  warning: status_from_rc: Action 64 (prim_vm_servers_alive_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error
> Jul 21 18:03:53 ha-idg-2 VirtualDomain(prim_vm_mausdb)[28564]: ERROR: mausdb_vm: live migration to qemu+ssh://ha-idg-1/system  failed: 1
> 
> That is the way i understand "Requested operation is not valid". It's not possible because of the constraint.
> I just wanted to be sure. And because the resource can't be migrated but the host is going to standby the resource is stopped. Right ?
> 
> Strange is that a second resource also running on node ha-idg-2 called prim_vm_mausdb also didn't migrate to the other node. And that's something i don't understand completely.
> The resource didn't have any location constraint.
> Both VirtualDomains have a vnc server configured (that i can monitor the boot procedure if i have starting problems). The vnc port for prim_vm_mausdb is 5900 in the configuration file.
> The port is set to auto for prim_vm_servers_alive because i forgot to configure it fix. So it must be s.th like 5900+ because both resources were running concurrently on the same node.
> But prim_vm_mausdb can't migrate because the port is occupied on the other node ha-idg-1:
> 
> Jul 21 18:03:53 ha-idg-2 VirtualDomain(prim_vm_mausdb)[28564]: ERROR: mausdb_vm: live migration to qemu+ssh://ha-idg-1/system  failed: 1
> Jul 21 18:03:53 ha-idg-2 lrmd[8573]:   notice: operation_finished: prim_vm_mausdb_migrate_to_0:28564:stderr [ error: internal error: early end of file from monitor: possible problem: ]
> Jul 21 18:03:53 ha-idg-2 lrmd[8573]:   notice: operation_finished: prim_vm_mausdb_migrate_to_0:28564:stderr [ Failed to start VNC server on `127.0.0.1:0,share=allow-exclusive': Failed to bind socket: Address already in use ]
> Jul 21 18:03:53 ha-idg-2 lrmd[8573]:   notice: operation_finished: prim_vm_mausdb_migrate_to_0:28564:stderr [  ]
> Jul 21 18:03:53 ha-idg-2 crmd[8576]:   notice: process_lrm_event: Operation prim_vm_mausdb_migrate_to_0: unknown error (node=ha-idg-2, call=110, rc=1, cib-update=573, confirmed=true)
> Jul 21 18:03:53 ha-idg-2 crmd[8576]:   notice: process_lrm_event: ha-idg-2-prim_vm_mausdb_migrate_to_0:110 [ error: internal error: early end of file from monitor: possible problem:\nFailed to start VNC server on `127.0.0.1:0,share=allow
> -exclusive': Failed to bind socket: Address already in use\n\n ]
> Jul 21 18:03:53 ha-idg-2 crmd[8576]:  warning: status_from_rc: Action 51 (prim_vm_mausdb_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error
> Jul 21 18:03:53 ha-idg-2 crmd[8576]:  warning: status_from_rc: Action 51 (prim_vm_mausdb_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error
> 
> Do i understand it correctly that the port is occupied on the node it should migrate to (ha-idg-1) ?

It looks like it

> But there is no vm running and i don't have a standalone vnc server configured. Why is the port occupied ?

Can't help there

> Btw: the network sockets are live migrated too during a live migration of a VirtualDomain resource ?
> It should be like that.
> 
> Thanks.
> 
> 
> Bernd

My memory is hazy, but I think TCP connections are migrated as long as
the migration is under the TCP timeout. I could be mis-remembering.
-- 
Ken Gaillot <kgaillot at redhat.com>