[ClusterLabs] two node cluster: vm starting - shutting down 15min later - starting again 15min later ... and so on

Thu Feb 9 19:10:34 EST 2017

On 02/09/2017 10:48 AM, Lentes, Bernd wrote:
> Hi,
> 
> i have a two node cluster with a vm as a resource. Currently i'm just testing and playing. My vm boots and shuts down again in 15min gaps.
> Surely this is related to "PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)" found in the logs. I googled, and it is said that this
> is due to time-based rule (http://oss.clusterlabs.org/pipermail/pacemaker/2009-May/001647.html). OK.
> But i don't have any time-based rules.
> This is the config for my vm:
> 
> primitive prim_vm_mausdb VirtualDomain \
>         params config="/var/lib/libvirt/images/xml/mausdb_vm.xml" \
>         params hypervisor="qemu:///system" \
>         params migration_transport=ssh \
>         op start interval=0 timeout=90 \
>         op stop interval=0 timeout=95 \
>         op monitor interval=30 timeout=30 \
>         op migrate_from interval=0 timeout=100 \
>         op migrate_to interval=0 timeout=120 \
>         meta allow-migrate=true \
>         meta target-role=Started \
>         utilization cpu=2 hv_memory=4099
> 
> The only constraint concerning the vm i had was a location (which i didn't create).

What is the constraint? If its ID starts with "cli-", it was created by
a command-line tool (such as crm_resource, crm shell or pcs, generally
for a "move" or "ban" command).

> Ok, this timer is available, i can set it to zero to disable it.

The timer is used for multiple purposes; I wouldn't recommend disabling
it. Also, this doesn't fix the problem; the problem will still occur
whenever the cluster recalculates, just not on a regular time schedule.

> But why does it influence my vm in such a manner ?
> 
> Excerp from the log:
> 
> ...
> Feb  9 16:19:38 ha-idg-1 VirtualDomain(prim_vm_mausdb)[13148]: INFO: Domain mausdb_vm already stopped.
> Feb  9 16:19:38 ha-idg-1 crmd[8407]:   notice: process_lrm_event: Operation prim_vm_mausdb_stop_0: ok (node=ha-idg-1, call=401, rc=0, cib-update=340, confirmed=true)
> Feb  9 16:19:38 ha-idg-1 kernel: [852506.947196] device vnet0 entered promiscuous mode
> Feb  9 16:19:38 ha-idg-1 kernel: [852507.008770] br0: port 2(vnet0) entering forwarding state
> Feb  9 16:19:38 ha-idg-1 kernel: [852507.008775] br0: port 2(vnet0) entering forwarding state
> Feb  9 16:19:38 ha-idg-1 kernel: [852507.172120] qemu-kvm: sending ioctl 5326 to a partition!
> Feb  9 16:19:38 ha-idg-1 kernel: [852507.172133] qemu-kvm: sending ioctl 80200204 to a partition!
> Feb  9 16:19:41 ha-idg-1 crmd[8407]:   notice: process_lrm_event: Operation prim_vm_mausdb_start_0: ok (node=ha-idg-1, call=402, rc=0, cib-update=341, confirmed=true)
> Feb  9 16:19:41 ha-idg-1 crmd[8407]:   notice: process_lrm_event: Operation prim_vm_mausdb_monitor_30000: ok (node=ha-idg-1, call=403, rc=0, cib-update=342, confirmed=false)
> Feb  9 16:19:48 ha-idg-1 kernel: [852517.049015] vnet0: no IPv6 routers present
> ...
> Feb  9 16:34:41 ha-idg-1 VirtualDomain(prim_vm_mausdb)[18272]: INFO: Issuing graceful shutdown request for domain mausdb_vm.
> Feb  9 16:35:06 ha-idg-1 kernel: [853434.550089] br0: port 2(vnet0) entering forwarding state
> Feb  9 16:35:06 ha-idg-1 kernel: [853434.550160] device vnet0 left promiscuous mode
> Feb  9 16:35:06 ha-idg-1 kernel: [853434.550165] br0: port 2(vnet0) entering disabled state
> Feb  9 16:35:06 ha-idg-1 ifdown:     vnet0
> Feb  9 16:35:06 ha-idg-1 ifdown: Interface not available and no configuration found.
> Feb  9 16:35:07 ha-idg-1 crmd[8407]:   notice: process_lrm_event: Operation prim_vm_mausdb_stop_0: ok (node=ha-idg-1, call=405, rc=0, cib-update=343, confirmed=true)
> ...
> 
> I deleted the location and until that vm is running fine for already 35min.

The logs don't go far back enough to have an idea why the VM was
stopped. Also, logs from the other node might be relevant, if it was the
DC (controller) at the time.

> System is SLES 11 SP4 64bit, vm is SLES 10 SP4 64bit.
> 
> Thanks.
> 
> Bernd